The source of the data is the following link: LINK
There are 4 files, they are:
The following is a brief summary of the data cleaning steps we performed:
The data tables contain information regarding the building address, location, service number, billing dates, total amount due.
Description of each column
Statistical Data Type of Each Column
The key issue in generating electricity is to determine how much capacity to generate in order to meet future demand.
Electricity usage forecasting involves predicting the demand for electricity over a specific eriod. This process has several uses, including energy procurement, where it helps suppliers purchase the right amount of energy to ensure a steady supply.
The advancement of smart infrastructure and integration of distributed renewable power has raised future supply, demand, and pricing uncertainties. This unpredictability has increased interest in price prediction and energy analysis.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from scipy import stats
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.metrics import accuracy_score, f1_score
import requests,urllib,os,pickle
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans, DBSCAN,AgglomerativeClustering
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.decomposition import PCA
import scipy.cluster.hierarchy as shc
from IPython import display
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import MinMaxScaler
from pmdarima.arima import auto_arima
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.varmax import VARMAX
from statsmodels.tsa.api import VAR
from statsmodels.tsa.stattools import grangercausalitytests, adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
import statsmodels.api as sm
from tqdm import tqdm_notebook
from itertools import product
import math
from statistics import mean
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.preprocessing.sequence import TimeseriesGenerator
pd.options.display.max_columns=25
import warnings
warnings.filterwarnings('ignore')
data_2012 = pd.read_excel(
'houston-houston-electricity-bills/coh-fy2012-ee-bills-july2011-june2012.xls'
)
orig_shape_2012 = data_2012.shape[0]
data_2012.shape
(57430, 24)
data_2012.head(5)
| Reliant Contract No | Service Address | Meter No | ESID | Business Area | Cost Center | Fund | Bill Type | Bill Date | Read Date | Due Date | Meter Read | Base Cost ($) | T&D Discretionary ($) | T&D Charges ($) | Current Due ($) | Adjustment ($) | Total Due ($) | Franchise Fee ($) | Voucher Date | Billed Demand | kWh Usage | Nodal Cu Charge ($) | Reliability Unit Charge ($) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2012-06-26 | 2012-06-21 | 2012-07-26 | 47940.0 | 61070.65 | 1638.01 | 10440.86 | 73232.11 | NaN | 73232.11 | -1047.28 | 2012-06-27 | 1507.291667 | 905421 | 82.59 | 0.0 |
| 1 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2012-05-25 | 2012-05-21 | 2012-06-24 | 47186.0 | 56319.47 | 1631.00 | 10364.63 | 68463.46 | NaN | 68463.46 | -1045.21 | 2012-05-30 | 1496.907217 | 824107 | 148.36 | 0.0 |
| 2 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2012-04-27 | 2012-04-23 | 2012-05-27 | 46499.0 | 68461.63 | 1674.67 | 10676.79 | 80847.87 | NaN | 80847.87 | -1081.11 | 2012-04-30 | 1562.500000 | 977744 | 34.78 | 0.0 |
| 3 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2012-03-27 | 2012-03-21 | 2012-04-26 | 45684.0 | 62036.29 | 1696.66 | 10681.48 | 74373.93 | NaN | 74373.93 | -1087.32 | 2012-03-28 | 1567.708333 | 876838 | -40.50 | 0.0 |
| 4 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2012-02-27 | 2012-02-21 | 2012-03-28 | 44954.0 | 61670.24 | 1703.80 | 10707.94 | 74080.27 | NaN | 74080.27 | -1090.08 | 2012-02-28 | 1577.083333 | 872898 | -1.71 | 0.0 |
data_2012.isna().sum()
Reliant Contract No 0 Service Address 0 Meter No 7809 ESID 0 Business Area 0 Cost Center 0 Fund 0 Bill Type 0 Bill Date 0 Read Date 0 Due Date 0 Meter Read 2 Base Cost ($) 0 T&D Discretionary ($) 0 T&D Charges ($) 0 Current Due ($) 0 Adjustment ($) 56259 Total Due ($) 0 Franchise Fee ($) 0 Voucher Date 0 Billed Demand 3 kWh Usage 0 Nodal Cu Charge ($) 1 Reliability Unit Charge ($) 4 dtype: int64
data_2012['Adjustment ($)'].value_counts(dropna=False)
NaN 56259 0.0 1170 9425.9 1 Name: Adjustment ($), dtype: int64
The column does not have any relevant information based on the above reported values. Electing to drop the column.
data_2012.drop(columns=['Adjustment ($)'], inplace=True)
There are quite a few columns in the dataset that signify relating to a unique person/house/business. Checking the unique counts of such columns.
check_unique_columns = ['Reliant Contract No', 'Service Address ', 'Meter No',
'ESID', 'Business Area', 'Cost Center',]
for col in check_unique_columns:
print(f'Number of Unique Values in {col}: {data_2012[col].nunique()}')
Number of Unique Values in Reliant Contract No: 5241 Number of Unique Values in Service Address : 5183 Number of Unique Values in Meter No: 4021 Number of Unique Values in ESID: 5241 Number of Unique Values in Business Area: 9 Number of Unique Values in Cost Center: 38
Based on the above reported values and further research online:
ESID signifies a unique ID provided to each customer subscribed to the electricity board. It would be best to choose ESID and Service Address columns going forward as these would provide number of unique customers and the areas (streets) where higher usage of electricity occurs.
Business Area signifies a grouping a number of buildings which covers a certain area. This would be useful usage patterns grouped by certain zones in the city.
data_2012['Bill Type'].value_counts(dropna=False)
T 56859 P 552 C 19 Name: Bill Type, dtype: int64
Bill Type could signify the type of the connection given. Since commercial, residential and government spaces would have different type of pricing and needs this column could be capturing that information.
(
data_2012['Service Address '].nunique(),
data_2012['Meter No'].nunique(),
data_2012['ESID'].nunique()
)
(5183, 4021, 5241)
The next 3 columns are: Bill Date, Read Date and Due Date. Of these it would be best to choose the Bill date across all the data files to keep the data consistent.
data_2012[['Meter Read', 'Billed Demand ', 'kWh Usage']].describe()
| Meter Read | Billed Demand | kWh Usage | |
|---|---|---|---|
| count | 57428.000000 | 57427.000000 | 5.743000e+04 |
| mean | 10008.024135 | 52.581303 | 2.249732e+04 |
| std | 19208.052944 | 432.027165 | 2.216349e+05 |
| min | 0.000000 | 0.000000 | 0.000000e+00 |
| 25% | 118.750000 | 0.000000 | 1.000000e+02 |
| 50% | 2583.000000 | 0.000000 | 2.980000e+02 |
| 75% | 7879.000000 | 11.000000 | 2.240000e+03 |
| max | 342348.000000 | 18495.555556 | 1.069344e+07 |
There are 3 columns that denote the amount of electricity: Meter Read, Billed Demand, kWh Usage.
Using kWh Usage as a standard unit of measurement.
data_2012[[
'Base Cost ($)', 'T&D Discretionary ($)', 'T&D Charges ($)',
'Current Due ($)', 'Total Due ($)', 'Franchise Fee ($)',
'Nodal Cu Charge ($)', 'Reliability Unit Charge ($)'
]].describe()
| Base Cost ($) | T&D Discretionary ($) | T&D Charges ($) | Current Due ($) | Total Due ($) | Franchise Fee ($) | Nodal Cu Charge ($) | Reliability Unit Charge ($) | |
|---|---|---|---|---|---|---|---|---|
| count | 57430.000000 | 57430.000000 | 57430.000000 | 57430.000000 | 57430.000000 | 57430.000000 | 57429.000000 | 57426.0 |
| mean | 1557.590034 | 404.377159 | 322.324780 | 2292.520167 | 2326.005266 | -36.249975 | 8.067123 | 0.0 |
| std | 15332.140262 | 12617.605024 | 2103.325682 | 23457.157709 | 23484.415824 | 255.356787 | 136.268511 | 0.0 |
| min | 0.000000 | -44.990000 | -680.340000 | -64.210000 | 0.000000 | -9352.010000 | -367.210000 | 0.0 |
| 25% | 6.870000 | 3.240000 | 7.380000 | 18.650000 | 18.430000 | -5.740000 | 0.000000 | 0.0 |
| 50% | 20.590000 | 3.910000 | 12.440000 | 38.240000 | 38.490000 | -0.500000 | 0.010000 | 0.0 |
| 75% | 155.252500 | 17.070000 | 98.847500 | 312.610000 | 317.212500 | 0.000000 | 0.280000 | 0.0 |
| max | 740473.960000 | 754326.010000 | 64282.330000 | 907483.660000 | 907483.660000 | 0.000000 | 18019.450000 | 0.0 |
Reliability Unit Charge does not contain any useful information. Electing to drop that column.
The columns other than Current Due or Total Due are adding up the value present in these two columns. Going forward choosing the column Total Due ($). Based on the above statistics the columns Current Due and Total Due represent the same value.
data_2012.columns
Index(['Reliant Contract No', 'Service Address ', 'Meter No', 'ESID',
'Business Area', 'Cost Center', 'Fund', 'Bill Type', 'Bill Date',
'Read Date', 'Due Date', 'Meter Read', 'Base Cost ($)',
'T&D Discretionary ($)', 'T&D Charges ($)', 'Current Due ($)',
'Total Due ($)', 'Franchise Fee ($)', 'Voucher Date', 'Billed Demand ',
'kWh Usage', 'Nodal Cu Charge ($)', 'Reliability Unit Charge ($)'],
dtype='object')
Based on the above analysis of the dataset choosing the following columns:
data_2012 = data_2012[[
'ESID', 'Business Area', 'Service Address ', 'Bill Type',
'Bill Date', 'Total Due ($)', 'kWh Usage'
]]
rename_cols = {
'ESID': 'esid',
'Business Area': 'business_area',
'Service Address ': 'service_address',
'Bill Type': 'bill_type',
'Bill Date': 'bill_date',
'Total Due ($)': 'total_due',
'kWh Usage': 'kwh_usage'
}
data_2012_main = data_2012.rename(columns=rename_cols)
Checking for Nulls again and dtypes
data_2012_main.isna().sum()
esid 0 business_area 0 service_address 0 bill_type 0 bill_date 0 total_due 0 kwh_usage 0 dtype: int64
data_2012_main.dtypes
esid object business_area int64 service_address object bill_type object bill_date datetime64[ns] total_due float64 kwh_usage int64 dtype: object
data_2012_main.shape
(57430, 7)
zscore_2012 = stats.zscore(data_2012_main[['total_due', 'kwh_usage']])
zscore_2012
| total_due | kwh_usage | |
|---|---|---|
| 0 | 3.019310 | 3.983720 |
| 1 | 2.816252 | 3.616835 |
| 2 | 3.343602 | 4.310039 |
| 3 | 3.067930 | 3.854755 |
| 4 | 3.055426 | 3.836978 |
| ... | ... | ... |
| 57425 | -0.070053 | -0.090029 |
| 57426 | -0.070059 | -0.090029 |
| 57427 | -0.070064 | -0.090029 |
| 57428 | -0.070255 | -0.090029 |
| 57429 | -0.011686 | -0.090029 |
57430 rows × 2 columns
Each zscore value signifies how many standard deviations away an individual value is from the mean. This is a good indicator to finding outliers in the dataframe.
Usually z-score=3 is considered as a cut-off value to set the limit. Therefore, any z-score greater than +3 or less than -3 is considered as outlier which is pretty much similar to standard deviation method
# data_2012_main = data_2012_main[(np.abs(zscore_2012) < 3).all(axis=1)]
data_2012_main.shape
(57430, 7)
The number of rows has decreased from 57,430 to 57,025. So 405 rows were outliers based on the data.
data_2012_main.head(5)
| esid | business_area | service_address | bill_type | bill_date | total_due | kwh_usage | |
|---|---|---|---|---|---|---|---|
| 0 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2012-06-26 | 73232.11 | 905421 |
| 1 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2012-05-25 | 68463.46 | 824107 |
| 2 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2012-04-27 | 80847.87 | 977744 |
| 3 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2012-03-27 | 74373.93 | 876838 |
| 4 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2012-02-27 | 74080.27 | 872898 |
orig_shape_2012 - data_2012_main.shape[0]
0
data_2012.to_csv('electricity_usage_data_2012.csv', index=False)
The trend graph of both the cost and energy usage is the same as the value of cost = energy usage times the cost per unit.
The code for the cleaning performed in this section is in the IPYNB: 'July 2012 to June 2013.ipynb'
data_2013 = pd.read_excel(
'houston-houston-electricity-bills/coh-fy2013-ee-bills-july2012-june2013.xlsx'
)
orig_shape_2013 = data_2013.shape[0]
data_2013.shape
(66776, 24)
data_2013.head(5)
| Reliant Contract No | Service Address | Meter No | ESID | Business Area | Cost Center | Fund | Bill Type | Bill Date | Read Date | Due Date | Meter Read | Base Cost ($) | T&D Discretionary ($) | T&D Charges ($) | Current Due ($) | Index Charge ($) | Total Due ($) | Franchise Fee ($) | Voucher Date | Billed Demand (KVA) | kWh Usage | Nodal Cu Charge ($) | Adder Charge ($) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2013-06-27 | 2013-06-23 | 2013-07-28 | 57061.0 | 57621.95 | 1319.11 | 10606.19 | 69785.20 | NaN | 69785.20 | -1016.90 | 2013-07-01 | 1462.500000 | 876113.0 | 237.95 | 0.0 |
| 1 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2013-05-29 | 2013-05-23 | 2013-06-28 | 56331.0 | 57981.59 | 1316.35 | 10676.66 | 70177.01 | NaN | 70177.01 | -1041.76 | 2013-05-30 | 1496.907217 | 879842.0 | 202.41 | 0.0 |
| 2 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2013-04-29 | 2013-04-23 | 2013-05-29 | 55598.0 | 67005.80 | 1357.92 | 10853.62 | 79309.40 | NaN | 79309.40 | -1036.92 | 2013-04-30 | 1502.083333 | 997407.0 | 92.06 | 0.0 |
| 3 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2013-03-28 | 2013-03-21 | 2013-04-27 | 54767.0 | 57008.44 | 1300.83 | 10473.16 | 68778.30 | NaN | 68778.30 | -995.50 | 2013-03-29 | 1432.989691 | 849351.0 | -4.13 | 0.0 |
| 4 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2013-02-26 | 2013-02-21 | 2013-03-28 | 54059.0 | 61287.57 | 1313.49 | 10534.33 | 73135.18 | NaN | 73135.18 | -1000.33 | 2013-02-27 | 1452.577320 | 911746.0 | -0.21 | 0.0 |
data_2013.isna().sum()
Reliant Contract No 0 Service Address 0 Meter No 15228 ESID 0 Business Area 0 Cost Center 0 Fund 0 Bill Type 0 Bill Date 0 Read Date 0 Due Date 0 Meter Read 12 Base Cost ($) 1 T&D Discretionary ($) 0 T&D Charges ($) 1 Current Due ($) 0 Index Charge ($) 65183 Total Due ($) 0 Franchise Fee ($) 0 Voucher Date 0 Billed Demand (KVA) 12 kWh Usage 1 Nodal Cu Charge ($) 5 Adder Charge ($) 14 dtype: int64
data_2013['Index Charge ($)'].value_counts(dropna=False)
NaN 65183 0.00 1592 -0.54 1 Name: Index Charge ($), dtype: int64
The column does not have any relevant information based on the above reported values. Electing to drop the column.
data_2013.drop(columns=['Index Charge ($)'], inplace=True)
There are quite a few columns in the dataset that signify relating to a unique person/house/business. Checking the unique counts of such columns.
check_unique_columns = ['Reliant Contract No', 'Service Address ', 'Meter No',
'ESID', 'Business Area', 'Cost Center',]
for col in check_unique_columns:
print(f'Number of Unique Values in {col}: {data_2013[col].nunique()}')
Number of Unique Values in Reliant Contract No: 5900 Number of Unique Values in Service Address : 5840 Number of Unique Values in Meter No: 4035 Number of Unique Values in ESID: 5898 Number of Unique Values in Business Area: 9 Number of Unique Values in Cost Center: 39
Based on the above reported values and further research online:
ESID signifies a unique ID provided to each customer subscribed to the electricity board. It would be best to choose ESID and Service Address columns going forward as these would provide number of unique customers and the areas (streets) where higher usage of electricity occurs.
Business Area signifies a grouping a number of buildings which covers a certain area. This would be useful usage patterns grouped by certain zones in the city.
data_2013['Bill Type'].value_counts(dropna=False)
T 66222 P 552 C 2 Name: Bill Type, dtype: int64
Bill Type could signify the type of the connection given. Since commercial, residential and government spaces would have different type of pricing and needs this column could be capturing that information.
(
data_2013['Service Address '].nunique(),
data_2013['Meter No'].nunique(),
data_2013['ESID'].nunique()
)
(5840, 4035, 5898)
The next 3 columns are: Bill Date, Read Date and Due Date. Of these it would be best to choose the Bill date across all the data files to keep the data consistent.
data_2013[['Meter Read', 'Billed Demand (KVA)', 'kWh Usage']].describe()
| Meter Read | Billed Demand (KVA) | kWh Usage | |
|---|---|---|---|
| count | 66764.000000 | 66764.000000 | 6.677500e+04 |
| mean | 9869.779829 | 44.208272 | 1.880421e+04 |
| std | 17911.694906 | 380.343991 | 2.024587e+05 |
| min | 0.000000 | 0.000000 | 0.000000e+00 |
| 25% | 0.000000 | 0.000000 | 1.000000e+00 |
| 50% | 3123.000000 | 0.000000 | 2.310000e+02 |
| 75% | 9007.250000 | 8.000000 | 1.680000e+03 |
| max | 239800.000000 | 16775.903614 | 9.689658e+06 |
There are 3 columns that denote the amount of electricity: Meter Read, Billed Demand, kWh Usage.
Using kWh Usage as a standard unit of measurement.
data_2013[[
'Base Cost ($)', 'T&D Discretionary ($)', 'T&D Charges ($)',
'Current Due ($)', 'Total Due ($)', 'Franchise Fee ($)',
'Nodal Cu Charge ($)', 'Adder Charge ($)'
]].describe()
| Base Cost ($) | T&D Discretionary ($) | T&D Charges ($) | Current Due ($) | Total Due ($) | Franchise Fee ($) | Nodal Cu Charge ($) | Adder Charge ($) | |
|---|---|---|---|---|---|---|---|---|
| count | 66775.000000 | 66776.000000 | 66775.000000 | 66776.000000 | 66776.000000 | 66776.000000 | 66771.000000 | 66762.0 |
| mean | 1249.628836 | 367.439382 | 278.533215 | 1901.861997 | 1902.580866 | -33.921297 | 6.005230 | 0.0 |
| std | 13443.314342 | 11796.148872 | 1997.001709 | 21320.228167 | 21320.836910 | 237.409585 | 132.671939 | 0.0 |
| min | 0.000000 | -7091.410000 | -37666.730000 | -44264.860000 | 0.000000 | -7017.800000 | -323.080000 | 0.0 |
| 25% | 0.070000 | 3.120000 | 6.020000 | 11.820000 | 11.820000 | -5.410000 | 0.000000 | 0.0 |
| 50% | 15.360000 | 6.200000 | 10.940000 | 32.170000 | 32.040000 | -0.460000 | 0.000000 | 0.0 |
| 75% | 111.690000 | 20.730000 | 77.865000 | 234.010000 | 230.737500 | 0.000000 | 0.150000 | 0.0 |
| max | 650951.220000 | 756478.120000 | 69826.360000 | 907001.560000 | 907001.560000 | 84.910000 | 20461.930000 | 0.0 |
Adder Charge ($) does not contain any useful information. Electing to drop that column. Previously this column was Reliability Unit Charge.
The columns other than Current Due or Total Due are adding up the value present in these two columns. Going forward choosing the column Total Due ($). Based on the above statistics the columns Current Due and Total Due represent the same value.
Based on the above analysis of the dataset choosing the following columns:
data_2013 = data_2013[[
'ESID', 'Business Area', 'Service Address ', 'Bill Type',
'Bill Date', 'Total Due ($)', 'kWh Usage'
]]
rename_cols = {
'ESID': 'esid',
'Business Area': 'business_area',
'Service Address ': 'service_address',
'Bill Type': 'bill_type',
'Bill Date': 'bill_date',
'Total Due ($)': 'total_due',
'kWh Usage': 'kwh_usage'
}
data_2013_main = data_2013.rename(columns=rename_cols)
Checking for Nulls again and dtypes
data_2013_main.isna().sum()
esid 0 business_area 0 service_address 0 bill_type 0 bill_date 0 total_due 0 kwh_usage 1 dtype: int64
data_2013_main.dropna(subset=['kwh_usage'], inplace=True)
data_2013_main.isna().sum()
esid 0 business_area 0 service_address 0 bill_type 0 bill_date 0 total_due 0 kwh_usage 0 dtype: int64
data_2013_main.dtypes
esid object business_area int64 service_address object bill_type object bill_date datetime64[ns] total_due float64 kwh_usage float64 dtype: object
data_2013_main.shape
(66775, 7)
zscore_2013 = stats.zscore(data_2013_main[['total_due', 'kwh_usage']])
zscore_2013
| total_due | kwh_usage | |
|---|---|---|
| 0 | 3.183862 | 4.234519 |
| 1 | 3.202239 | 4.252938 |
| 2 | 3.630570 | 4.833629 |
| 3 | 3.136636 | 4.102333 |
| 4 | 3.340984 | 4.410522 |
| ... | ... | ... |
| 66771 | -0.057672 | -0.080314 |
| 66772 | -0.057229 | -0.080314 |
| 66773 | -0.057217 | -0.080314 |
| 66774 | -0.057477 | -0.080314 |
| 66775 | -0.056839 | -0.080314 |
66775 rows × 2 columns
Each zscore value signifies how many standard deviations away an individual value is from the mean. This is a good indicator to finding outliers in the dataframe.
Usually z-score=3 is considered as a cut-off value to set the limit. Therefore, any z-score greater than +3 or less than -3 is considered as outlier which is pretty much similar to standard deviation method
# data_2013_main = data_2013_main[(np.abs(zscore_2013) < 3).all(axis=1)]
data_2013_main.shape
(66775, 7)
The number of rows has decreased from 66,775 to 66,360. So 415 rows were outliers based on the data.
data_2013_main.head(5)
| esid | business_area | service_address | bill_type | bill_date | total_due | kwh_usage | |
|---|---|---|---|---|---|---|---|
| 0 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2013-06-27 | 69785.20 | 876113.0 |
| 1 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2013-05-29 | 70177.01 | 879842.0 |
| 2 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2013-04-29 | 79309.40 | 997407.0 |
| 3 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2013-03-28 | 68778.30 | 849351.0 |
| 4 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2013-02-26 | 73135.18 | 911746.0 |
orig_shape_2013 - data_2013_main.shape[0]
1
data_2013_main.to_csv('electricity_usage_data_2013.csv', index=False)
data_2013_2 = pd.read_excel(
'houston-houston-electricity-bills/coh-ee-bills-may2012-apr2013.xlsx'
)
orig_shape_2013_2 = data_2013_2.shape[0]
data_2013_2.shape
(65806, 24)
data_2013_2.head(5)
| Reliant Contract No | Service Address | Meter No | ESID | Business Area | Cost Center | Fund | Bill Type | Bill Date | Read Date | Due Date | Meter Read | Base Cost ($) | T&D Discretionary ($) | T&D Charges ($) | Current Due ($) | Adjustment ($) | Total Due ($) | Franchise Fee ($) | Voucher Date | Billed Demand (KVA) | kWh Usage | Nodal Cu Charge ($) | Reliability Unit Charge ($) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2013-04-29 | 2013-04-23 | 2013-05-29 | 55598.0 | 67005.80 | 1357.92 | 10853.62 | 79309.40 | NaN | 79309.40 | -1036.92 | 2013-04-30 | 1502.083333 | 997407.0 | 92.06 | 0.0 |
| 1 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2013-03-28 | 2013-03-21 | 2013-04-27 | 54767.0 | 57008.44 | 1300.83 | 10473.16 | 68778.30 | NaN | 68778.30 | -995.50 | 2013-03-29 | 1432.989691 | 849351.0 | -4.13 | 0.0 |
| 2 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2013-02-26 | 2013-02-21 | 2013-03-28 | 54059.0 | 61287.57 | 1313.49 | 10534.33 | 73135.18 | NaN | 73135.18 | -1000.33 | 2013-02-27 | 1452.577320 | 911746.0 | -0.21 | 0.0 |
| 3 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2013-01-28 | 2013-01-22 | 2013-02-27 | 53299.0 | 64657.23 | 1369.20 | 10878.65 | 77043.90 | NaN | 77043.90 | -1048.66 | 2013-01-29 | 1498.969072 | 969810.0 | 138.82 | 0.0 |
| 4 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2012-12-28 | 2012-12-20 | 2013-01-27 | 52491.0 | 60872.54 | 1612.61 | 10928.60 | 73740.34 | NaN | 73740.34 | -1088.01 | 2012-12-31 | 1572.916667 | 927935.0 | 326.59 | 0.0 |
data_2013_2.isna().sum()
Reliant Contract No 0 Service Address 0 Meter No 14049 ESID 0 Business Area 0 Cost Center 0 Fund 0 Bill Type 0 Bill Date 0 Read Date 0 Due Date 0 Meter Read 16 Base Cost ($) 0 T&D Discretionary ($) 0 T&D Charges ($) 0 Current Due ($) 0 Adjustment ($) 64229 Total Due ($) 0 Franchise Fee ($) 0 Voucher Date 0 Billed Demand (KVA) 16 kWh Usage 1 Nodal Cu Charge ($) 4 Reliability Unit Charge ($) 16 dtype: int64
This column was named Index Charge in the other FY 2013 electricity usage data file
data_2013_2['Adjustment ($)'].value_counts(dropna=False)
NaN 64229 0.0 1576 9425.9 1 Name: Adjustment ($), dtype: int64
The column does not have any relevant information based on the above reported values. Electing to drop the column.
data_2013_2.drop(columns=['Adjustment ($)'], inplace=True)
There are quite a few columns in the dataset that signify relating to a unique person/house/business. Checking the unique counts of such columns.
check_unique_columns = [
'Reliant Contract No', 'Service Address ', 'Meter No',
'ESID', 'Business Area', 'Cost Center',
]
for col in check_unique_columns:
print(f'Number of Unique Values in {col}: {data_2013_2[col].nunique()}')
Number of Unique Values in Reliant Contract No: 5786 Number of Unique Values in Service Address : 5725 Number of Unique Values in Meter No: 4035 Number of Unique Values in ESID: 5783 Number of Unique Values in Business Area: 9 Number of Unique Values in Cost Center: 39
Based on the above reported values and further research online:
ESID signifies a unique ID provided to each customer subscribed to the electricity board. It would be best to choose ESID and Service Address columns going forward as these would provide number of unique customers and the areas (streets) where higher usage of electricity occurs.
Business Area signifies a grouping a number of buildings which covers a certain area. This would be useful usage patterns grouped by certain zones in the city.
data_2013_2['Bill Type'].value_counts(dropna=False)
T 65252 P 552 C 2 Name: Bill Type, dtype: int64
Bill Type could signify the type of the connection given. Since commercial, residential and government spaces would have different type of pricing and needs this column could be capturing that information.
data_2013_2['Service Address '].nunique(), data_2013_2['Meter No'].nunique(), data_2013_2['ESID'].nunique()
(5725, 4035, 5783)
The next 3 columns are: Bill Date, Read Date and Due Date. Of these it would be best to choose the Bill date across all the data files to keep the data consistent.
data_2013_2[['Meter Read', 'Billed Demand (KVA)', 'kWh Usage']].describe()
| Meter Read | Billed Demand (KVA) | kWh Usage | |
|---|---|---|---|
| count | 65790.000000 | 65790.000000 | 6.580500e+04 |
| mean | 9743.299217 | 45.011893 | 1.926132e+04 |
| std | 17901.894291 | 382.634210 | 2.047392e+05 |
| min | 0.000000 | 0.000000 | 0.000000e+00 |
| 25% | 0.000000 | 0.000000 | 1.000000e+00 |
| 50% | 3004.500000 | 0.000000 | 2.410000e+02 |
| 75% | 8669.000000 | 9.000000 | 1.789000e+03 |
| max | 239800.000000 | 16775.903614 | 9.689658e+06 |
There are 3 columns that denote the amount of electricity: Meter Read, Billed Demand, kWh Usage.
Using kWh Usage as a standard unit of measurement.
data_2013_2[[
'Base Cost ($)', 'T&D Discretionary ($)', 'T&D Charges ($)',
'Current Due ($)', 'Total Due ($)', 'Franchise Fee ($)',
'Nodal Cu Charge ($)', 'Reliability Unit Charge ($)'
]].describe()
| Base Cost ($) | T&D Discretionary ($) | T&D Charges ($) | Current Due ($) | Total Due ($) | Franchise Fee ($) | Nodal Cu Charge ($) | Reliability Unit Charge ($) | |
|---|---|---|---|---|---|---|---|---|
| count | 65806.000000 | 65806.000000 | 65806.000000 | 65806.000000 | 65806.000000 | 65806.000000 | 65802.000000 | 65790.0 |
| mean | 1286.967789 | 374.537243 | 283.605427 | 1950.938646 | 1951.890851 | -34.636836 | 5.695807 | 0.0 |
| std | 13665.877497 | 11888.489071 | 1995.337024 | 21571.815431 | 21572.522026 | 241.014300 | 132.109391 | 0.0 |
| min | 0.000000 | -7091.410000 | -37666.730000 | -44264.860000 | 0.000000 | -7017.800000 | -323.080000 | 0.0 |
| 25% | 0.070000 | 3.100000 | 6.020000 | 10.830000 | 11.020000 | -5.520000 | 0.000000 | 0.0 |
| 50% | 16.070000 | 5.800000 | 11.140000 | 32.770000 | 32.650000 | -0.460000 | 0.000000 | 0.0 |
| 75% | 119.257500 | 21.527500 | 81.792500 | 250.297500 | 247.592500 | 0.000000 | 0.140000 | 0.0 |
| max | 650951.220000 | 756478.120000 | 69826.360000 | 907001.560000 | 907001.560000 | 84.910000 | 20461.930000 | 0.0 |
Reliability Unit Charge ($) does not contain any useful information. Electing to drop that column.
The columns other than Current Due or Total Due are adding up the value present in these two columns. Going forward choosing the column Total Due ($). Based on the above statistics the columns Current Due and Total Due represent the same value.
Based on the above analysis of the dataset choosing the following columns:
data_2013_2 = data_2013_2[[
'ESID', 'Business Area', 'Service Address ', 'Bill Type',
'Bill Date', 'Total Due ($)', 'kWh Usage'
]]
rename_cols = {
'ESID': 'esid',
'Business Area': 'business_area',
'Service Address ': 'service_address',
'Bill Type': 'bill_type',
'Bill Date': 'bill_date',
'Total Due ($)': 'total_due',
'kWh Usage': 'kwh_usage'
}
data_2013_2_main = data_2013_2.rename(columns=rename_cols)
Checking for Nulls again and dtypes
data_2013_2_main.isna().sum()
esid 0 business_area 0 service_address 0 bill_type 0 bill_date 0 total_due 0 kwh_usage 1 dtype: int64
data_2013_2_main.dropna(subset=['kwh_usage'], inplace=True)
data_2013_2_main.isna().sum()
esid 0 business_area 0 service_address 0 bill_type 0 bill_date 0 total_due 0 kwh_usage 0 dtype: int64
data_2013_2_main.dtypes
esid object business_area int64 service_address object bill_type object bill_date datetime64[ns] total_due float64 kwh_usage float64 dtype: object
data_2013_2_main.shape
(65805, 7)
zscore_2013_2 = stats.zscore(data_2013_2_main[['total_due', 'kwh_usage']])
zscore_2013_2
| total_due | kwh_usage | |
|---|---|---|
| 0 | 3.585927 | 4.777556 |
| 1 | 3.097755 | 4.054406 |
| 2 | 3.299719 | 4.359162 |
| 3 | 3.480909 | 4.642764 |
| 4 | 3.327772 | 4.438234 |
| ... | ... | ... |
| 65801 | -0.058835 | -0.081652 |
| 65802 | -0.059092 | -0.081652 |
| 65803 | -0.058462 | -0.081652 |
| 65804 | -0.058694 | -0.081652 |
| 65805 | -0.058593 | -0.081652 |
65805 rows × 2 columns
Each zscore value signifies how many standard deviations away an individual value is from the mean. This is a good indicator to finding outliers in the dataframe.
Usually z-score=3 is considered as a cut-off value to set the limit. Therefore, any z-score greater than +3 or less than -3 is considered as outlier which is pretty much similar to standard deviation method
# data_2013_2_main = data_2013_2_main[(np.abs(zscore_2013_2) < 3).all(axis=1)]
data_2013_2_main.shape
(65805, 7)
The number of rows has decreased from 65,805 to 65,388. So 417 rows were outliers based on the data.
data_2013_2_main.head(5)
| esid | business_area | service_address | bill_type | bill_date | total_due | kwh_usage | |
|---|---|---|---|---|---|---|---|
| 0 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2013-04-29 | 79309.40 | 997407.0 |
| 1 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2013-03-28 | 68778.30 | 849351.0 |
| 2 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2013-02-26 | 73135.18 | 911746.0 |
| 3 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2013-01-28 | 77043.90 | 969810.0 |
| 4 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2012-12-28 | 73740.34 | 927935.0 |
orig_shape_2013_2 - data_2013_2_main.shape[0]
1
data_2013_2_main.to_csv('electricity_usage_data_2013_2.csv', index=False)
data_2014 = pd.read_excel(
'houston-houston-electricity-bills/coh-fy2014-ee-bills-july2013-june2014.xlsx'
)
orig_shape_2014 = data_2014.shape[0]
data_2014.shape
(67838, 24)
data_2014.head(5)
| Reliant Contract No | Service Address | Meter No | ESID | Business Area | Cost Center | Fund | Bill Type | Bill Date | Read Date | Due Date | Meter Read | Base Cost ($) | T&D Discretionary ($) | T&D Charges ($) | Current Due ($) | Index Charge ($) | Total Due ($) | Franchise Fee ($) | Voucher Date | Billed Demand (KVA) | kWh Usage | Nodal Cu Charge ($) | Adder Charge ($) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2014-05-28 | 2014-05-21 | 2014-06-27 | 64981.0 | 36098.63 | 2600.53 | 10078.92 | 50249.89 | 0.0 | 50249.89 | -1014.83 | 2014-05-29 | 1470.526316 | 818192 | 479.19 | 0.0 |
| 1 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2014-04-28 | 2014-04-22 | 2014-05-28 | 64299.0 | 38619.91 | 2576.95 | 9978.70 | 52378.23 | 0.0 | 52378.23 | -1005.86 | 2014-04-29 | 1442.268041 | 875338 | 161.23 | 0.0 |
| 2 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2014-03-27 | 2014-03-23 | 2014-04-26 | 63569.0 | 36275.51 | 2538.54 | 9896.25 | 49788.85 | 0.0 | 49788.85 | -981.00 | 2014-03-28 | 1425.000000 | 822201 | 93.04 | 0.0 |
| 3 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2014-02-27 | 2014-02-23 | 2014-03-29 | 62884.0 | 43283.31 | 2207.88 | 9373.41 | 56056.70 | 0.0 | 56056.70 | -1053.49 | 2014-02-28 | 1513.265306 | 981036 | 77.36 | 0.0 |
| 4 | 2059605 | 10518 BELLAIRE | 303261 | 1008901000140050014100 | 2000 | 2000040005 | 8300 | T | 2014-01-27 | 2014-01-21 | 2014-02-26 | 62066.0 | 37987.10 | 1992.16 | 8762.55 | 49748.12 | 0.0 | 49748.12 | -950.63 | 2014-01-30 | 1373.195876 | 860995 | 17.73 | 0.0 |
data_2014.isna().sum()
Reliant Contract No 0 Service Address 0 Meter No 17633 ESID 0 Business Area 0 Cost Center 0 Fund 0 Bill Type 0 Bill Date 0 Read Date 0 Due Date 0 Meter Read 16 Base Cost ($) 6 T&D Discretionary ($) 0 T&D Charges ($) 0 Current Due ($) 0 Index Charge ($) 3 Total Due ($) 0 Franchise Fee ($) 0 Voucher Date 0 Billed Demand (KVA) 16 kWh Usage 0 Nodal Cu Charge ($) 7 Adder Charge ($) 3 dtype: int64
data_2014['Index Charge ($)'].value_counts(dropna=False)
0.00 67320
NaN 3
8.23 3
8.04 2
7.36 2
...
13587.81 1
10639.12 1
9521.45 1
93.97 1
70.22 1
Name: Index Charge ($), Length: 511, dtype: int64
The column does does have information regarding a certain price. Since we are using the total due amount at the end, Index Charge ($) does not need to be present again, as it would be included in the total due amount.
data_2014.drop(columns=['Index Charge ($)'], inplace=True)
There are quite a few columns in the dataset that signify relating to a unique person/house/business. Checking the unique counts of such columns.
check_unique_columns = [
'Reliant Contract No', 'Service Address ', 'Meter No',
'ESID', 'Business Area', 'Cost Center'
]
for col in check_unique_columns:
print(f'Number of Unique Values in {col}: {data_2014[col].nunique()}')
Number of Unique Values in Reliant Contract No: 5888 Number of Unique Values in Service Address : 5824 Number of Unique Values in Meter No: 4026 Number of Unique Values in ESID: 5885 Number of Unique Values in Business Area: 8 Number of Unique Values in Cost Center: 34
NOTE: Compared to previous years, there is one less business area.
Based on the above reported values and further research online:
ESID signifies a unique ID provided to each customer subscribed to the electricity board. It would be best to choose ESID and Service Address columns going forward as these would provide number of unique customers and the areas (streets) where higher usage of electricity occurs.
Business Area signifies a grouping a number of buildings which covers a certain area. This would be useful usage patterns grouped by certain zones in the city.
data_2014['Bill Type'].value_counts(dropna=False)
T 67340 P 498 Name: Bill Type, dtype: int64
Bill Type could signify the type of the connection given. Since commercial, residential and government spaces would have different type of pricing and needs this column could be capturing that information.
Previously there were 3 types of Bills. T, P, and C. But in year 2014 there are only 2 types.
(
data_2014['Service Address '].nunique(),
data_2014['Meter No'].nunique(),
data_2014['ESID'].nunique()
)
(5824, 4026, 5885)
The next 3 columns are: Bill Date, Read Date and Due Date. Of these it would be best to choose the Bill date across all the data files to keep the data consistent.
data_2014[['Meter Read', 'Billed Demand (KVA)', 'kWh Usage']].describe()
| Meter Read | Billed Demand (KVA) | kWh Usage | |
|---|---|---|---|
| count | 67822.000000 | 67822.000000 | 6.783800e+04 |
| mean | 11922.180792 | 41.816244 | 1.757642e+04 |
| std | 19950.210597 | 365.658193 | 1.925858e+05 |
| min | 0.000000 | 0.000000 | 0.000000e+00 |
| 25% | 0.000000 | 0.000000 | 1.000000e+00 |
| 50% | 4507.000000 | 0.000000 | 2.200000e+02 |
| 75% | 12230.750000 | 7.000000 | 1.419750e+03 |
| max | 492196.000000 | 17348.148148 | 9.383361e+06 |
There are 3 columns that denote the amount of electricity: Meter Read, Billed Demand, kWh Usage.
Using kWh Usage as a standard unit of measurement.
data_2014[[
'Base Cost ($)', 'T&D Discretionary ($)', 'T&D Charges ($)',
'Current Due ($)', 'Total Due ($)', 'Franchise Fee ($)',
'Nodal Cu Charge ($)', 'Adder Charge ($)'
]].describe()
| Base Cost ($) | T&D Discretionary ($) | T&D Charges ($) | Current Due ($) | Total Due ($) | Franchise Fee ($) | Nodal Cu Charge ($) | Adder Charge ($) | |
|---|---|---|---|---|---|---|---|---|
| count | 67832.000000 | 67838.000000 | 67838.000000 | 67838.000000 | 67838.000000 | 67838.000000 | 67831.000000 | 67835.000000 |
| mean | 752.493762 | 346.176442 | 243.271038 | 1429.004427 | 1428.948580 | -30.802815 | 2.847491 | 8.410448 |
| std | 8850.260472 | 11412.058754 | 1761.077978 | 17419.486301 | 17420.904924 | 210.583496 | 42.663730 | 276.832133 |
| min | -1269.920000 | -45.930000 | -1715.790000 | -104.790000 | 0.000000 | -6179.070000 | -0.010000 | 0.000000 |
| 25% | 0.070000 | 3.140000 | 5.750000 | 12.080000 | 12.080000 | -4.830000 | 0.000000 | 0.000000 |
| 50% | 10.020000 | 6.440000 | 10.380000 | 27.330000 | 27.270000 | -0.470000 | 0.030000 | 0.000000 |
| 75% | 59.690000 | 17.800000 | 60.692500 | 164.850000 | 164.212500 | 0.000000 | 0.170000 | 0.000000 |
| max | 586509.580000 | 781792.760000 | 70591.250000 | 906606.680000 | 906606.680000 | 0.000000 | 5531.690000 | 12528.970000 |
Adder Charge ($) does not contain any useful information. Electing to drop that column. Previously this column was Reliability Unit Charge.
The columns other than Current Due or Total Due are adding up the value present in these two columns. Going forward choosing the column Total Due ($). Based on the above statistics the columns Current Due and Total Due represent the same value.
Based on the above analysis of the dataset choosing the following columns:
data_2014 = data_2014[[
'ESID', 'Business Area', 'Service Address ', 'Bill Type',
'Bill Date', 'Total Due ($)', 'kWh Usage'
]]
rename_cols = {
'ESID': 'esid',
'Business Area': 'business_area',
'Service Address ': 'service_address',
'Bill Type': 'bill_type',
'Bill Date': 'bill_date',
'Total Due ($)': 'total_due',
'kWh Usage': 'kwh_usage'
}
data_2014_main = data_2014.rename(columns=rename_cols)
Checking for Nulls again and dtypes
data_2014_main.isna().sum()
esid 0 business_area 0 service_address 0 bill_type 0 bill_date 0 total_due 0 kwh_usage 0 dtype: int64
data_2014_main.dtypes
esid object business_area int64 service_address object bill_type object bill_date datetime64[ns] total_due float64 kwh_usage int64 dtype: object
data_2014_main.shape
(67838, 7)
zscore_2014 = stats.zscore(data_2014_main[['total_due', 'kwh_usage']])
zscore_2014
| total_due | kwh_usage | |
|---|---|---|
| 0 | 2.802455 | 4.157220 |
| 1 | 2.924627 | 4.453952 |
| 2 | 2.775990 | 4.178037 |
| 3 | 3.135782 | 5.002792 |
| 4 | 2.773652 | 4.379476 |
| ... | ... | ... |
| 67833 | -0.044563 | -0.078056 |
| 67834 | 0.024122 | -0.078056 |
| 67835 | -0.047622 | -0.078056 |
| 67836 | -0.047384 | -0.078056 |
| 67837 | -0.043133 | -0.078056 |
67838 rows × 2 columns
Each zscore value signifies how many standard deviations away an individual value is from the mean. This is a good indicator to finding outliers in the dataframe.
Usually z-score=3 is considered as a cut-off value to set the limit. Therefore, any z-score greater than +3 or less than -3 is considered as outlier which is pretty much similar to standard deviation method
# data_2014_main = data_2014_main[(np.abs(zscore_2014) < 3).all(axis=1)]
data_2014_main.shape
(67838, 7)
The number of rows has decreased from 67,838 to 67,427. So 411 rows were outliers based on the data.
data_2014_main.head(5)
| esid | business_area | service_address | bill_type | bill_date | total_due | kwh_usage | |
|---|---|---|---|---|---|---|---|
| 0 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2014-05-28 | 50249.89 | 818192 |
| 1 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2014-04-28 | 52378.23 | 875338 |
| 2 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2014-03-27 | 49788.85 | 822201 |
| 3 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2014-02-27 | 56056.70 | 981036 |
| 4 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 2014-01-27 | 49748.12 | 860995 |
orig_shape_2014 - data_2014_main.shape[0]
0
data_2014_main.to_csv('electricity_usage_data_2014.csv', index=False)
df_list = [data_2012_main, data_2013_main, data_2013_2_main, data_2014_main]
data = pd.concat(df_list)
data.shape
(257848, 7)
Since we cleaned each df separetely there should not be any NANs, but checking for it nonetheless.
data.isna().sum()
esid 0 business_area 0 service_address 0 bill_type 0 bill_date 0 total_due 0 kwh_usage 0 dtype: int64
Since there is an overlap in time period for the CSV files it is important not to have repeating rows.
We can check for this in the following way: A particular ESID should be billed only once. Therefore by taking a subset of ESID, Business Area and Bill Date we can know if a particular customer's billing info has been repeated in the df or not.
dup_rows_index = data.duplicated(
subset=['esid', 'business_area', 'service_address', 'bill_date']
)
(dup_rows_index).sum()
66595
This confirms the doubt that the overlap with the FY 2012 and FY 2013 with the CSV file has generated duplicate rows in the data. We need to remove these columns.
data_main = data[~(dup_rows_index)]
data_main.shape
(191253, 7)
data_main.to_csv('Electricity_Usage_Data.csv', index=False)
import time
from pprint import pprint
data_main = pd.read_csv('Electricity_Usage_Data.csv')
data_main[['bill_date']] = data_main[['bill_date']].apply(pd.to_datetime)
data_main.loc[:,'bill_date'] = data_main['bill_date'].apply(
lambda x: pd.to_datetime(f'{x.year}-{x.month}-01')
)
C:\Users\swarg\AppData\Local\Temp\ipykernel_5928\4102228820.py:1: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
data_main.loc[:,'bill_date'] = data_main['bill_date'].apply(lambda x: pd.to_datetime(f'{x.year}-{x.month}-01'))
viz_df = data_main.set_index('bill_date')
viz_df.head()
| esid | business_area | service_address | bill_type | total_due | kwh_usage | |
|---|---|---|---|---|---|---|
| bill_date | ||||||
| 2012-06-01 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 73232.11 | 905421.0 |
| 2012-05-01 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 68463.46 | 824107.0 |
| 2012-04-01 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 80847.87 | 977744.0 |
| 2012-03-01 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 74373.93 | 876838.0 |
| 2012-02-01 | 1008901000140050014100 | 2000 | 10518 BELLAIRE | T | 74080.27 | 872898.0 |
address_enc = LabelEncoder()
bill_type_enc = LabelEncoder()
data_main['address_enc'] = address_enc.fit_transform(
data_main['service_address']
)
data_main['bill_type_enc'] = bill_type_enc.fit_transform(
data_main['bill_type']
)
data_main['year'] = data_main['bill_date'].apply(lambda x:x.year)
data_main['month'] = data_main['bill_date'].apply(lambda x:x.month)
sns.heatmap(data_main.corr())
C:\Users\swarg\AppData\Local\Temp\ipykernel_5928\942240332.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. sns.heatmap(data_main.corr())
<Axes: >
There do not seem to be any features that have high correlation with kwh usage except total due. But this is to be expected since the amount of energy used is directly proportional to the cost.
It might be difficult to use the features as they are for ML modeling.
# Pie chart for 'Bill Type'
explode = (0, 0.4, 0.2)
plt.figure(figsize=(5, 4))
print('value_counts:\n', data_main['bill_type'].value_counts())
data_main['bill_type'].value_counts().plot(kind='pie', explode=explode) #, autopct='%1.10f%%')
plt.title('Bill Type Pie Chart')
plt.xlabel('Type of Bill')
plt.ylabel('Frequency')
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
plt.tight_layout()
plt.show()
value_counts: T 187663 P 1297 C 11 Name: bill_type, dtype: int64
# Bar chart for 'Business Area'
plt.figure(figsize=(8, 6))
ax = data_main['business_area'].value_counts().plot(kind='bar')
plt.title('business_area')
plt.xlabel('Area')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
The business area 2000 is the most populous area based on the frequency plot.
And the most common type of Bill type is T.
viz_df_2 = data_main.groupby('bill_date').agg(
{'kwh_usage':'mean'}
).reset_index()
temp_list = [0 for _ in range(12)]
date_dict = {
'2011': temp_list.copy(),
'2012': temp_list.copy(),
'2013': temp_list.copy(),
'2014': temp_list.copy(),
}
for date, usage in viz_df_2.values:
date_dict[str(date.year)][date.month-1] = usage
usage_2011 = date_dict['2011']
usage_2012 = date_dict['2012']
usage_2013 = date_dict['2013']
usage_2014 = date_dict['2014']
plt.figure(figsize=(8,6))
plt.plot(range(12), usage_2011, label='2011')
plt.plot(range(12), usage_2012, label='2012')
plt.plot(range(12), usage_2013, label='2013')
plt.plot(range(12), usage_2014, label='2014')
plt.xlabel('Month')
plt.ylabel('kWh Usage')
plt.title('Average Monthly Usage of Electricity')
plt.legend()
plt.show()
In 2011 the first 6 months and in 2014 the last 6 months have the value 0. This is due to the fact that the data has been collected for each year from July of the current year to June of the next year (Financial Year).
Similar to the previous plot, we can see the same trend. But one thing to be observed is that the trend across all these years has remained relatively same across the months even in these 4 years.
The code for the visualizations performed in this section is in the IPYNB: 'Viz_Sourabh.ipynb'
def plotbox(df, column):
plot_features = df.groupby(pd.Grouper(freq=str(60)+'T')).mean().copy()
plot_features[column] = [eval('x.%s'%column) for x in plot_features.index]
plot_features.boxplot('kwh_usage', by=column, figsize=(12, 8), grid=False)
plt.ylabel('kWh Usage')
plt.xlabel(column)
plt.show()
plotbox(viz_df, 'month')
Based on the above box plot, we can see that the highest energy is consumed in the months of June-September. That is in the summer/fall season energy usage is quite high.
plotbox(viz_df, 'year')
It was expected that energy consumption would increase through the years. It was surprising to see that the energy use in fact decreased. Further analysis needs to be done in order to see why this might have happened.
data_main_agg = data_main.groupby('business_area').agg({
'esid': pd.Series.nunique,
'total_due': 'mean',
'kwh_usage': 'mean',
})
data_main_agg.sort_values(by='kwh_usage', ascending=False)
| esid | total_due | kwh_usage | |
|---|---|---|---|
| business_area | |||
| 1600 | 2 | 20686.296296 | 267252.759259 |
| 4200 | 32 | 12314.233342 | 139198.855389 |
| 2800 | 202 | 7763.883129 | 99229.628037 |
| 6500 | 10 | 2069.572381 | 24864.882784 |
| 2500 | 574 | 1762.559758 | 21484.003090 |
| 2000 | 4759 | 1685.935398 | 16412.006549 |
| 2100 | 23 | 1153.184006 | 13450.079826 |
| 3600 | 449 | 856.760282 | 9159.579703 |
| 6800 | 39 | 308.368067 | 3174.383499 |
plt.plot(
data_main_agg.index,
data_main_agg['total_due'],
color='b',
label='Average Cost'
)
plt.plot(
data_main_agg.index,
data_main_agg['kwh_usage'],
color='r',
label='Average kwh Usage'
)
plt.title('Average Energy Usage by Business Area')
plt.legend()
plt.show()
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(9, 4))
fig.suptitle('Average Energy Usage by Business Area')
axes[0].plot(
data_main_agg.index,
data_main_agg['total_due'],
color='b',
label='Average Cost'
)
axes[0].legend()
axes[1].plot(
data_main_agg.index,
data_main_agg['kwh_usage'],
color='r',
label='Average kwh Usage'
)
axes[1].legend()
fig.tight_layout()
plt.show()
This indicates that both the cost and kwh usage are indicating usage of electricity and the only difference between is the scale (a multiplicative factor).
Therfore, analyzing just the kwh usage or just the price might be enough and both should be used in predictive tasks, i.e., if we are predicting kwh usage then price should not be present in the train dataset, as these two are different representations of the same thing which is usage.
monthly_en = viz_df.resample('M', label = 'left')['kwh_usage'].max()
plt.figure(figsize = (12,6))
#plotting the max monthly energy consumption
plt.plot(monthly_en)
plt.xlim(monthly_en.index.min(), monthly_en.index.max())
locator = mdates.MonthLocator(bymonthday = 1, interval = 2)
fmt = mdates.DateFormatter('%m-%y')
X = plt.gca().xaxis
# Setting the locator
X.set_major_locator(locator)
# Specify formatter
X.set_major_formatter(fmt)
plt.xticks(rotation = 45)
plt.ylabel('Max Energy consumption in kWh')
plt.xlabel('Date')
plt.title('Peak energy usage of a month over the years 2011-14')
plt.show()
There is a noticable trend in the usage of energy across the months starting from 2011 to 2014.
sns.pairplot(data_main)
<seaborn.axisgrid.PairGrid at 0x28f1556fc40>
# Stacked bar chart to visualize the distribution of bill types across different business areas
business_areas = data_main['business_area'].unique()
if len(business_areas) > 0:
bill_types = data_main['bill_type'].unique()
business_area_counts = []
for ba in business_areas:
counts = data_main.loc[data_main['business_area'] == ba]['bill_type'].value_counts()
business_area_counts.append(counts)
bill_type_counts_by_business_area = pd.concat(business_area_counts, axis=1, keys=business_areas)
print('bill_type_counts_by_business_area:')
bill_type_counts_by_business_area.plot(kind='bar', stacked=True)
# y-axis label
plt.ylabel('Number of Bills')
# Chart title
plt.title('Distribution of Bill Types by Business Area')
plt.show()
bill_type_counts_by_business_area:
data_main = pd.read_csv('Electricity_Usage_Data.csv')
data_main[['bill_date']] = data_main[['bill_date']].apply(pd.to_datetime)
data_main.loc[:,'bill_date'] = data_main['bill_date'].apply(
lambda x: pd.to_datetime(f'{x.year}-{x.month}-01')
)
address_enc = LabelEncoder()
bill_type_enc = LabelEncoder()
data_main['address_enc'] = address_enc.fit_transform(
data_main['service_address']
)
data_main['bill_type_enc'] = bill_type_enc.fit_transform(
data_main['bill_type']
)
data_main['year'] = data_main['bill_date'].apply(lambda x:x.year)
data_main['month'] = data_main['bill_date'].apply(lambda x:x.month)
Models Proposed:
The code for the ML/Stats performed in this section is in the IPYNB: 'Regression_Sourabh.ipynb'
data_main.head()
Q1 = data_main['kwh_usage'].quantile(0.25)
Q3 = data_main['kwh_usage'].quantile(0.75)
IQR = Q3 - Q1
Q1, Q3, IQR
data_main_filt = data_main[~(
(data_main['kwh_usage'] < (Q1 - 1.5 * IQR)) |
(data_main['kwh_usage'] > (Q3 + 1.5 * IQR))
)]
data_main.shape, data_main_filt.shape
((191253, 11), (157498, 11))
Therefore, 33,755 rows have values that are considered outliers based on the IQR Method.
def regression_metrics(model, X_train, X_test, y_train, y_test):
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
print(f'Train R2 Score: {r2_score(y_train, y_train_pred)}')
print(f'Test R2 Score: {r2_score(y_test, y_test_pred)}')
print(f'Train MSE Score: {mean_squared_error(y_train, y_train_pred)}')
print(f'Test MSE Score: {mean_squared_error(y_test, y_test_pred)}')
X = data_main[[
'business_area', 'address_enc', 'bill_type_enc', 'year', 'month'
]]
y = data_main[['kwh_usage']]
X_filt = data_main_filt[[
'business_area', 'address_enc', 'bill_type_enc', 'year', 'month'
]]
y_filt = data_main_filt[['kwh_usage']]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
X_train_filt, X_test_filt, y_train_filt, y_test_filt = train_test_split(
X_filt, y_filt, test_size=0.2, random_state=42
)
reg = LinearRegression().fit(X_train, y_train)
reg_filt = LinearRegression().fit(X_train_filt, y_train_filt)
print('With Outliers')
regression_metrics(reg, X_train, X_test, y_train, y_test)
print('='*50)
print('Without Outliers')
regression_metrics(
reg_filt, X_train_filt, X_test_filt, y_train_filt, y_test_filt
)
With Outliers Train R2 Score: 0.009173833177215318 Test R2 Score: 0.007167343822345074 Train MSE Score: 39877725926.044525 Test MSE Score: 49616557595.280334 ================================================== Without Outliers Train R2 Score: 0.08955528262685242 Test R2 Score: 0.09509437141206156 Train MSE Score: 672051.7504459427 Test MSE Score: 694695.2188051218
gbr = GradientBoostingRegressor().fit(X_train, np.ravel(y_train))
gbr_filt = GradientBoostingRegressor().fit(
X_train_filt, np.ravel(y_train_filt)
)
print('With Outliers')
regression_metrics(
gbr, X_train, X_test, np.ravel(y_train), np.ravel(y_test)
)
print('='*50)
print('Without Outliers')
regression_metrics(
gbr_filt, X_train_filt, X_test_filt, np.ravel(y_train_filt), np.ravel(y_test_filt)
)
With Outliers Train R2 Score: 0.7043295722890507 Test R2 Score: 0.7315554088149221 Train MSE Score: 11899831348.320082 Test MSE Score: 13415449659.919912 ================================================== Without Outliers Train R2 Score: 0.1523656445581465 Test R2 Score: 0.15408797851650213 Train MSE Score: 625687.800085661 Test MSE Score: 649405.8808887758
dtr = DecisionTreeRegressor().fit(X_train, np.ravel(y_train))
dtr_filt = DecisionTreeRegressor().fit(X_train_filt, np.ravel(y_train_filt))
print('With Outliers')
regression_metrics(
dtr, X_train, X_test, np.ravel(y_train), np.ravel(y_test)
)
print('='*50)
print('Without Outliers')
regression_metrics(
dtr_filt, X_train_filt, X_test_filt, np.ravel(y_train_filt), np.ravel(y_test_filt)
)
With Outliers Train R2 Score: 0.9328533753005721 Test R2 Score: 0.9252101010668637 Train MSE Score: 2702446489.891373 Test MSE Score: 3737606035.4899774 ================================================== Without Outliers Train R2 Score: 0.9934125253356817 Test R2 Score: 0.8822752077931225 Train MSE Score: 4862.594943652055 Test MSE Score: 90377.21470310935
Models proposed:
Logistic Regression - widely used interpretable model which can be used for setting a baseline accuracy. This model assumed linear relationship between the variables, so mighht give bad results
Decision Tree Classifier - It can handle the non-linear relationships well between input and target variable. Can be prone to overfitting on the train data.
Random Forest Classifier - ensemble model, takes advantage of multiple decision trees to create a powerful model. But this model is not easy to interpret and requires more computational resource to run.
def classification_metrics(model, X_train, X_test, y_train, y_test):
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
print(
f'Train F1 Score: {f1_score(y_train, y_train_pred, average="macro")}'
)
print(
f'Test F1 Score: {f1_score(y_test, y_test_pred, average="macro")}'
)
print(f'Train Accuracy Score: {accuracy_score(y_train, y_train_pred)}')
print(f'Test Accuract Score: {accuracy_score(y_test, y_test_pred)}')
X = data_main[[
'business_area', 'address_enc', 'kwh_usage', 'year', 'month'
]]
y = data_main[['bill_type_enc']]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
lreg = LogisticRegression().fit(X_train, np.ravel(y_train))
C:\Users\sbelde3\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
classification_metrics(lreg, X_train, X_test, y_train, y_test)
Train F1 Score: 0.44078631425979947 Test F1 Score: 0.43995857012872036 Train Accuracy Score: 0.9931112011607691 Test Accuract Score: 0.9935426524796738
dtc = DecisionTreeClassifier().fit(X_train, y_train)
classification_metrics(dtc, X_train, X_test, y_train, y_test)
Train F1 Score: 0.9998707454630322 Test F1 Score: 0.6666403148167668 Train Accuracy Score: 0.9999934641377237 Test Accuract Score: 0.9998431413557816
rfc = RandomForestClassifier().fit(X_train, np.ravel(y_train))
classification_metrics(
dtc, X_train, X_test, np.ravel(y_train), np.ravel(y_test)
)
Train F1 Score: 0.9998707454630322 Test F1 Score: 0.6666403148167668 Train Accuracy Score: 0.9999934641377237 Test Accuract Score: 0.9998431413557816
Proposed Models:
Currently still working on time-series analysis.
# Mount drive to colab file
from google.colab import drive
drive.mount('/content/drive')
# Insert, change the directory
import sys
sys.path.insert(0,'/content/drive/MyDrive/CS418-Project-main')
%cd /content/drive/MyDrive/CS418-Project-main
Mounted at /content/drive /content/drive/MyDrive/CS418-Project-main
# !pip install pmdarima
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pmdarima
Downloading pmdarima-2.0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (1.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 17.2 MB/s eta 0:00:00
Requirement already satisfied: statsmodels>=0.13.2 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (0.13.5)
Requirement already satisfied: Cython!=0.29.18,!=0.29.31,>=0.29 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (0.29.34)
Requirement already satisfied: scikit-learn>=0.22 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (1.2.2)
Requirement already satisfied: pandas>=0.19 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (1.5.3)
Requirement already satisfied: setuptools!=50.0.0,>=38.6.0 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (67.7.2)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (1.26.15)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (1.10.1)
Requirement already satisfied: numpy>=1.21.2 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (1.22.4)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.10/dist-packages (from pmdarima) (1.2.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.19->pmdarima) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.19->pmdarima) (2022.7.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.22->pmdarima) (3.1.0)
Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.10/dist-packages (from statsmodels>=0.13.2->pmdarima) (0.5.3)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from statsmodels>=0.13.2->pmdarima) (23.1)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.2->statsmodels>=0.13.2->pmdarima) (1.16.0)
Installing collected packages: pmdarima
Successfully installed pmdarima-2.0.3
Load all the data from the files, which was cleaned and pre-processed by all the team members.
data_2012_main = pd.read_csv('electricity_usage_data_2012.csv')
data_2013_main = pd.read_csv('electricity_usage_data_2013.csv')
data_2013_2_main = pd.read_csv('electricity_usage_data_2013_2.csv')
data_2014_main = pd.read_csv('electricity_usage_data_2014.csv')
# Remove outliers in data
zscore_2012 = stats.zscore(data_2012_main[['total_due', 'kwh_usage']])
print('data_2012_main shape before removing outliers: {}'.format(data_2012_main.shape))
data_2012_main = data_2012_main[(np.abs(zscore_2012) < 3).all(axis=1)]
print('data_2012_main shape after removing outliers: {}'.format(data_2012_main.shape), '\n')
zscore_2013 = stats.zscore(data_2013_main[['total_due', 'kwh_usage']])
print('data_2013_main shape before removing outliers: {}'.format(data_2013_main.shape))
data_2013_main = data_2013_main[(np.abs(zscore_2013) < 3).all(axis=1)]
print('data_2013_main shape after removing outliers: {}'.format(data_2013_main.shape), '\n')
zscore_2013_2 = stats.zscore(data_2013_2_main[['total_due', 'kwh_usage']])
print('data_2013_2_main shape before removing outliers: {}'.format(data_2013_2_main.shape))
data_2013_2_main = data_2013_2_main[(np.abs(zscore_2013_2) < 3).all(axis=1)]
print('data_2013_2_main shape after removing outliers: {}'.format(data_2013_2_main.shape), '\n')
zscore_2014 = stats.zscore(data_2014_main[['total_due', 'kwh_usage']])
print('data_2014_main shape before removing outliers: {}'.format(data_2014_main.shape))
data_2014_main = data_2014_main[(np.abs(zscore_2014) < 3).all(axis=1)]
print('data_2014_main shape after removing outliers: {}'.format(data_2014_main.shape), '\n')
data_2012_main shape before removing outliers: (57025, 7) data_2012_main shape after removing outliers: (55974, 7) data_2013_main shape before removing outliers: (66360, 7) data_2013_main shape after removing outliers: (65306, 7) data_2013_2_main shape before removing outliers: (65805, 7) data_2013_2_main shape after removing outliers: (65388, 7) data_2014_main shape before removing outliers: (67838, 7) data_2014_main shape after removing outliers: (67427, 7)
Verify the data to check nulls, duplicate rows, and save final data into csv file
df_list = [data_2012_main, data_2013_main, data_2013_2_main, data_2014_main]
data = pd.concat(df_list)
print('data.shape', data.shape, '\n')
# Checking nulls in the data
print('Nulls in the data:\n', data.isna().sum(), '\n')
# Checking for duplicate rows
dup_rows_index = data.duplicated(subset=['esid', 'business_area', 'service_address', 'bill_date'])
print('duplicate rows', (dup_rows_index).sum(), '\n')
# Removing the duplicates
data_main = data[~(dup_rows_index)]
print('data_main.shape', data_main.shape, '\n')
# last result - data_main.shape (190848, 7)
# saving into csv files
data_main.to_csv('Electricity_Usage_Data.csv', index=False)
data.shape (254095, 7) Nulls in the data: esid 0 business_area 0 service_address 0 bill_type 0 bill_date 0 total_due 0 kwh_usage 0 dtype: int64 duplicate rows 65124 data_main.shape (188971, 7)
plt.rcParams.update({'font.size': 12})
data_df = pd.read_csv('Electricity_Usage_Data.csv')
address_enc = LabelEncoder()
bill_type_enc = LabelEncoder()
data_df['bill_date']=pd.to_datetime(data_df['bill_date'])
data_df['year'] = data_df['bill_date'].apply(lambda x: x.year)
data_df['month'] = data_df['bill_date'].apply(lambda x: x.month)
data_df['year_month'] = data_df['bill_date'].dt.date.apply(lambda x: x.strftime('%Y-%m'))
data_df['week'] = data_df.apply(lambda row: row['bill_date'].week+52*(int(row['year'])-2011),axis=1)
data_df.head()
| esid | business_area | service_address | bill_type | bill_date | total_due | kwh_usage | year | month | year_month | week | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1008901000140220013100 | 2500 | 17000 ALDINE WFLD | T | 2012-02-28 | 10612.81 | 100461.0 | 2012 | 2 | 2012-02 | 61 |
| 1 | 1008901000140220013100 | 2500 | 17000 ALDINE WFLD | T | 2012-01-31 | 11689.34 | 117843.0 | 2012 | 1 | 2012-01 | 57 |
| 2 | 1008901000140220013100 | 2500 | 17000 ALDINE WFLD | T | 2011-12-30 | 11173.67 | 115471.0 | 2011 | 12 | 2011-12 | 52 |
| 3 | 1008901000141370015100 | 2000 | 449 W 19TH | T | 2012-06-18 | 6457.79 | 79984.0 | 2012 | 6 | 2012-06 | 77 |
| 4 | 1008901000141370015100 | 2000 | 449 W 19TH | T | 2012-05-15 | 5809.91 | 69067.0 | 2012 | 5 | 2012-05 | 72 |
df = data_df[['kwh_usage', 'week']]
df = df.groupby(by=['week']).mean()
plt.figure(figsize=(35, 7))
plt.grid()
plt.plot(df)
plt.title('Average Weekly Consumption of Electricity', fontsize=25)
plt.xlabel('Week', fontsize=20)
plt.ylabel('Kwh Usage', fontsize=20)
plt.tight_layout()
Stationary time series is the one whose satistical properties(mean, var, etc.) donot change over time. \
We need to perform additional check to find if the series is stationary?
We'll use rolling statistics first, followed by Dickey-Fuller test to check if the series is stationary and make it stationary if not.
rolling_mean = df.rolling(2).mean()
rolling_std = df.rolling(2).std()
plt.figure(figsize=(35, 8))
plt.grid()
plt.plot(df, color="blue",label="Original Usage")
plt.plot(rolling_mean, color="red", label="Rolling Mean")
plt.plot(rolling_std, color="black", label = "Rolling Standard Deviation")
plt.title('Electricity Time Series, Rolling Mean, Standard Deviation', fontsize=25)
plt.xlabel('Week', fontsize=20)
plt.ylabel('Energy Usage in Kwh', fontsize=20)
plt.legend(loc="upper left")
plt.tight_layout()
We see that statistics are not constant over the time, but to confirm we'll perform additional statistical test using augmented Dickey-Fuller method.
H0 = Null-hypothesis => It has unit root, the series is non-stationary
H1 = Alternate-hypothesis => No unit root, the series is stationary
If p-value < critical value [0.05] -> We reject the null-hypothesis H0
If p-value > critical value [0.05] -> We fail to reject null-hypothesis H0
def aug_dickey_fuller_test(df):
adft = adfuller(df, autolag="AIC")
output_df = pd.DataFrame({"Values":[adft[0],adft[1],adft[2],adft[3], adft[4]['1%'], adft[4]['5%'], adft[4]['10%']],
"Metric":["Test Statistics","p-value","No. of lags used","Number of observations used", "critical value (1%)", "critical value (5%)", "critical value (10%)"]})
print(output_df)
aug_dickey_fuller_test(df)
Values Metric 0 -1.831026 Test Statistics 1 0.365171 p-value 2 12.000000 No. of lags used 3 144.000000 Number of observations used 4 -3.476598 critical value (1%) 5 -2.881829 critical value (5%) 6 -2.577589 critical value (10%)
As (Test Statistics -1.83 > -2.88 critical value (5%)), p-value > 0.05, we fail to reject the null-hypothesis, and thus the time series is non-stationary.
# First we'll perform differencing on the data to see if it becomes stationary
diff_df = df.diff()
plt.figure(figsize=(32, 8))
plt.grid()
plt.plot(diff_df)
plt.title('Electricity Usage after Differencing', fontsize=25)
plt.xlabel('Week', fontsize=20)
plt.ylabel('Kwh Usage', fontsize=20)
plt.tight_layout()
# Confirm with the dickey-fuller test
aug_dickey_fuller_test(diff_df.dropna())
Values Metric 0 -4.550132 Test Statistics 1 0.000159 p-value 2 11.000000 No. of lags used 3 144.000000 Number of observations used 4 -3.476598 critical value (1%) 5 -2.881829 critical value (5%) 6 -2.577589 critical value (10%)
Here, (Test Statistics = -4.55 < critical value (5%) of -2.88), p-value is < 0.05 so we reject the null hypothesis and accept the alternate hypothesis, hence considers the time series is stationary for order difference of 1 (d).
# week wise data split
train_data = df.loc['0':'160']
test_data = df.loc['160':]
plt.figure(figsize=(35, 7))
plt.grid()
plt.plot(train_data, c='blue', label='Train kwh_usage')
plt.plot(test_data, c='orange', label='Test kwh_usage')
plt.legend(loc='upper left', prop={'size':20})
plt.title('Average Weekly Consumption of Electricity for Train, Test data', fontsize=25)
plt.xlabel('Week', fontsize=20)
plt.ylabel('Kwh Usage', fontsize=20)
plt.tight_layout()
# Find the order of the ARIMA model
order_df = auto_arima(df, trace=True, suppress_warnings=True)
order_df.summary()
Performing stepwise search to minimize aic ARIMA(2,1,2)(0,0,0)[0] intercept : AIC=2845.007, Time=0.31 sec ARIMA(0,1,0)(0,0,0)[0] intercept : AIC=2947.089, Time=0.07 sec ARIMA(1,1,0)(0,0,0)[0] intercept : AIC=2939.772, Time=0.05 sec ARIMA(0,1,1)(0,0,0)[0] intercept : AIC=2885.511, Time=0.24 sec ARIMA(0,1,0)(0,0,0)[0] : AIC=2945.228, Time=0.25 sec ARIMA(1,1,2)(0,0,0)[0] intercept : AIC=2884.697, Time=1.06 sec ARIMA(2,1,1)(0,0,0)[0] intercept : AIC=2861.588, Time=0.46 sec ARIMA(3,1,2)(0,0,0)[0] intercept : AIC=2807.713, Time=0.27 sec ARIMA(3,1,1)(0,0,0)[0] intercept : AIC=2818.737, Time=0.16 sec ARIMA(4,1,2)(0,0,0)[0] intercept : AIC=2808.010, Time=1.73 sec ARIMA(3,1,3)(0,0,0)[0] intercept : AIC=2802.027, Time=1.24 sec ARIMA(2,1,3)(0,0,0)[0] intercept : AIC=2814.505, Time=1.01 sec ARIMA(4,1,3)(0,0,0)[0] intercept : AIC=2804.960, Time=1.34 sec ARIMA(3,1,4)(0,0,0)[0] intercept : AIC=2802.381, Time=0.74 sec ARIMA(2,1,4)(0,0,0)[0] intercept : AIC=2809.810, Time=0.47 sec ARIMA(4,1,4)(0,0,0)[0] intercept : AIC=2805.810, Time=0.88 sec ARIMA(3,1,3)(0,0,0)[0] : AIC=2800.870, Time=0.41 sec ARIMA(2,1,3)(0,0,0)[0] : AIC=2813.233, Time=0.47 sec ARIMA(3,1,2)(0,0,0)[0] : AIC=2806.267, Time=0.25 sec ARIMA(4,1,3)(0,0,0)[0] : AIC=2802.849, Time=0.64 sec ARIMA(3,1,4)(0,0,0)[0] : AIC=2801.509, Time=0.64 sec ARIMA(2,1,2)(0,0,0)[0] : AIC=2838.458, Time=0.25 sec ARIMA(2,1,4)(0,0,0)[0] : AIC=2808.483, Time=0.41 sec ARIMA(4,1,2)(0,0,0)[0] : AIC=2820.962, Time=0.41 sec ARIMA(4,1,4)(0,0,0)[0] : AIC=2804.433, Time=0.72 sec Best model: ARIMA(3,1,3)(0,0,0)[0] Total fit time: 14.579 seconds
| Dep. Variable: | y | No. Observations: | 157 |
|---|---|---|---|
| Model: | SARIMAX(3, 1, 3) | Log Likelihood | -1393.435 |
| Date: | Thu, 04 May 2023 | AIC | 2800.870 |
| Time: | 00:33:06 | BIC | 2822.219 |
| Sample: | 0 | HQIC | 2809.542 |
| - 157 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| ar.L1 | -0.0398 | 0.118 | -0.337 | 0.736 | -0.271 | 0.191 |
| ar.L2 | -0.9198 | 0.035 | -26.484 | 0.000 | -0.988 | -0.852 |
| ar.L3 | -0.2937 | 0.117 | -2.508 | 0.012 | -0.523 | -0.064 |
| ma.L1 | -0.7610 | 0.113 | -6.758 | 0.000 | -0.982 | -0.540 |
| ma.L2 | 1.0158 | 0.076 | 13.279 | 0.000 | 0.866 | 1.166 |
| ma.L3 | -0.4996 | 0.105 | -4.756 | 0.000 | -0.706 | -0.294 |
| sigma2 | 3.525e+06 | 1.35e-08 | 2.61e+14 | 0.000 | 3.53e+06 | 3.53e+06 |
| Ljung-Box (L1) (Q): | 0.01 | Jarque-Bera (JB): | 40.44 |
|---|---|---|---|
| Prob(Q): | 0.94 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 1.94 | Skew: | 0.89 |
| Prob(H) (two-sided): | 0.02 | Kurtosis: | 4.75 |
# With the optimised (p, d, q) based on the above auto_arima results, and as we also know (d=1) which is identified while making the series stationary
model = ARIMA(train_data, order=(3, 1, 3)).fit()
# Prediction
pred = model.predict(start=len(train_data)-1,end=(len(df)-1))
# Model Summary
model.summary()
| Dep. Variable: | kwh_usage | No. Observations: | 135 |
|---|---|---|---|
| Model: | ARIMA(3, 1, 3) | Log Likelihood | -1193.626 |
| Date: | Thu, 04 May 2023 | AIC | 2401.252 |
| Time: | 00:57:05 | BIC | 2421.537 |
| Sample: | 0 | HQIC | 2409.495 |
| - 135 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| ar.L1 | 0.0083 | 0.142 | 0.058 | 0.954 | -0.271 | 0.287 |
| ar.L2 | -0.9245 | 0.040 | -23.401 | 0.000 | -1.002 | -0.847 |
| ar.L3 | -0.2610 | 0.137 | -1.899 | 0.058 | -0.530 | 0.008 |
| ma.L1 | -0.7966 | 0.151 | -5.282 | 0.000 | -1.092 | -0.501 |
| ma.L2 | 1.0392 | 0.100 | 10.400 | 0.000 | 0.843 | 1.235 |
| ma.L3 | -0.4557 | 0.126 | -3.607 | 0.000 | -0.703 | -0.208 |
| sigma2 | 3.384e+06 | 2.39e-09 | 1.42e+15 | 0.000 | 3.38e+06 | 3.38e+06 |
| Ljung-Box (L1) (Q): | 0.00 | Jarque-Bera (JB): | 5.02 |
|---|---|---|---|
| Prob(Q): | 0.99 | Prob(JB): | 0.08 |
| Heteroskedasticity (H): | 2.03 | Skew: | 0.46 |
| Prob(H) (two-sided): | 0.02 | Kurtosis: | 3.26 |
# Evaluation
print('Mean Absolute Error: %.2f' % mean_absolute_error(test_data['kwh_usage'].values, pred))
print('Root Mean Squared Error: %.2f' % np.sqrt(mean_squared_error(test_data['kwh_usage'], pred)))
Mean Absolute Error: 1493.04 Root Mean Squared Error: 2260.68
pred = pd.Series(list(pred.values), index=list(test_data.index))
plt.figure(figsize=(35, 8))
plt.grid()
plt.plot(train_data, label = 'Train kwh_usage')
plt.plot(test_data, label = 'Test kwh_usage')
plt.plot(pred, label = 'Predicted kwh_usage')
plt.title('Weekly Electricity Usage Forecasting', fontsize=25)
plt.xlabel('Week', fontsize=20)
plt.ylabel('Average Energy Usage Consumption in Kwh', fontsize=20)
plt.legend(loc='upper left', prop={'size': 20})
plt.tight_layout()
# week wise data
var_df = data_df[['week', 'total_due', 'kwh_usage']]
group_df = var_df.groupby(['week']).mean()
fig, axes = plt.subplots(nrows=2, ncols=1, dpi=120, figsize=(20,6))
for i, ax in enumerate(axes.flatten()):
# data = var_df['bill_date', var_df.columns[i]].groupby('bill_date').mean()
data = group_df[group_df.columns[i]]
ax.plot(data, color='blue', linewidth=1)
# Decorations
ax.set_title(group_df.columns[i])
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
ax.spines["top"].set_alpha(0)
ax.tick_params(labelsize=6)
plt.tight_layout();
print('kwh_usage augmented dickey-fuller test:')
aug_dickey_fuller_test(group_df['kwh_usage'])
print('\n')
print('total_due augmented dickey-fuller test:')
aug_dickey_fuller_test(group_df['total_due'])
kwh_usage augmented dickey-fuller test:
Values Metric
0 -1.831026 Test Statistics
1 0.365171 p-value
2 12.000000 No. of lags used
3 144.000000 Number of observations used
4 -3.476598 critical value (1%)
5 -2.881829 critical value (5%)
6 -2.577589 critical value (10%)
total_due augmented dickey-fuller test:
Values Metric
0 -1.710329 Test Statistics
1 0.425798 p-value
2 12.000000 No. of lags used
3 144.000000 Number of observations used
4 -3.476598 critical value (1%)
5 -2.881829 critical value (5%)
6 -2.577589 critical value (10%)
# Perform one order differecing and see if it becomes stationary and confirm with the dickey-fuller test
print('kwh_usage adf test on diff data:')
aug_dickey_fuller_test(group_df['kwh_usage'].diff().dropna())
print('\n')
print('total_due adf test on diff data:')
aug_dickey_fuller_test(group_df['total_due'].diff().dropna())
kwh_usage adf test on diff data:
Values Metric
0 -4.550132 Test Statistics
1 0.000159 p-value
2 11.000000 No. of lags used
3 144.000000 Number of observations used
4 -3.476598 critical value (1%)
5 -2.881829 critical value (5%)
6 -2.577589 critical value (10%)
total_due adf test on diff data:
Values Metric
0 -5.288116 Test Statistics
1 0.000006 p-value
2 11.000000 No. of lags used
3 144.000000 Number of observations used
4 -3.476598 critical value (1%)
5 -2.881829 critical value (5%)
6 -2.577589 critical value (10%)
One order differencing of kwh_usage, total_due made the series stationary
print('kwh_usage causes total_due?\n')
print('------------------')
granger_1 = grangercausalitytests(group_df[['kwh_usage', 'total_due']], 4)
print('\n total_due causes kwh_usage?\n')
print('------------------')
granger_2 = grangercausalitytests(group_df[['total_due', 'kwh_usage']], 4)
kwh_usage causes total_due? ------------------ Granger Causality number of lags (no zero) 1 ssr based F test: F=0.4014 , p=0.5273 , df_denom=153, df_num=1 ssr based chi2 test: chi2=0.4092 , p=0.5224 , df=1 likelihood ratio test: chi2=0.4087 , p=0.5226 , df=1 parameter F test: F=0.4014 , p=0.5273 , df_denom=153, df_num=1 Granger Causality number of lags (no zero) 2 ssr based F test: F=3.4939 , p=0.0329 , df_denom=150, df_num=2 ssr based chi2 test: chi2=7.2208 , p=0.0270 , df=2 likelihood ratio test: chi2=7.0577 , p=0.0293 , df=2 parameter F test: F=3.4939 , p=0.0329 , df_denom=150, df_num=2 Granger Causality number of lags (no zero) 3 ssr based F test: F=2.2156 , p=0.0888 , df_denom=147, df_num=3 ssr based chi2 test: chi2=6.9634 , p=0.0731 , df=3 likelihood ratio test: chi2=6.8106 , p=0.0782 , df=3 parameter F test: F=2.2156 , p=0.0888 , df_denom=147, df_num=3 Granger Causality number of lags (no zero) 4 ssr based F test: F=0.9665 , p=0.4279 , df_denom=144, df_num=4 ssr based chi2 test: chi2=4.1075 , p=0.3916 , df=4 likelihood ratio test: chi2=4.0534 , p=0.3988 , df=4 parameter F test: F=0.9665 , p=0.4279 , df_denom=144, df_num=4 total_due causes kwh_usage? ------------------ Granger Causality number of lags (no zero) 1 ssr based F test: F=2.1277 , p=0.1467 , df_denom=153, df_num=1 ssr based chi2 test: chi2=2.1694 , p=0.1408 , df=1 likelihood ratio test: chi2=2.1545 , p=0.1422 , df=1 parameter F test: F=2.1277 , p=0.1467 , df_denom=153, df_num=1 Granger Causality number of lags (no zero) 2 ssr based F test: F=1.0387 , p=0.3564 , df_denom=150, df_num=2 ssr based chi2 test: chi2=2.1467 , p=0.3419 , df=2 likelihood ratio test: chi2=2.1320 , p=0.3444 , df=2 parameter F test: F=1.0387 , p=0.3564 , df_denom=150, df_num=2 Granger Causality number of lags (no zero) 3 ssr based F test: F=1.7638 , p=0.1566 , df_denom=147, df_num=3 ssr based chi2 test: chi2=5.5434 , p=0.1361 , df=3 likelihood ratio test: chi2=5.4460 , p=0.1419 , df=3 parameter F test: F=1.7638 , p=0.1566 , df_denom=147, df_num=3 Granger Causality number of lags (no zero) 4 ssr based F test: F=1.4199 , p=0.2303 , df_denom=144, df_num=4 ssr based chi2 test: chi2=6.0344 , p=0.1966 , df=4 likelihood ratio test: chi2=5.9185 , p=0.2053 , df=4 parameter F test: F=1.4199 , p=0.2303 , df_denom=144, df_num=4
# week wise data split
train_data = group_df.loc['0':'160']
test_data = group_df.loc['160':]
model = VAR(train_data)
order = model.select_order(maxlags=20)
print(order.summary())
VAR Order Selection (* highlights the minimums)
==================================================
AIC BIC FPE HQIC
--------------------------------------------------
0 24.84 24.89 6.145e+10 24.86
1 24.17 24.31 3.142e+10 24.23
2 24.01 24.25 2.680e+10 24.11
3 23.91 24.24 2.412e+10 24.04
4 23.27 23.70* 1.273e+10 23.44
5 23.23 23.75 1.224e+10 23.44*
6 23.25 23.87 1.256e+10 23.50
7 23.25 23.96 1.250e+10 23.54
8 23.23 24.05 1.236e+10 23.56
9 23.15 24.06 1.139e+10 23.52
10 23.13 24.13 1.120e+10 23.54
11 23.17 24.26 1.163e+10 23.61
12 23.20 24.40 1.211e+10 23.69
13 23.01* 24.30 1.003e+10* 23.53
14 23.05 24.44 1.051e+10 23.62
15 23.03 24.51 1.031e+10 23.63
16 23.05 24.63 1.061e+10 23.69
17 23.12 24.79 1.140e+10 23.80
18 23.18 24.95 1.222e+10 23.90
19 23.19 25.05 1.241e+10 23.94
20 23.21 25.17 1.285e+10 24.01
--------------------------------------------------
Based on the above results, the minimum vaues of AIC, FPE are observed at lag-13, so the lag to be choosen is 13.
# As VARMAX takes care of stationarity using the property enforce_stationarity, we don't need to explicity handle it
var_model = VARMAX(train_data, order=(13, 0), enforce_stationarity= True)
fitted_model = var_model.fit(disp=False)
print(fitted_model.summary())
Statespace Model Results
======================================================================================
Dep. Variable: ['total_due', 'kwh_usage'] No. Observations: 135
Model: VAR(13) Log Likelihood -1914.555
+ intercept AIC 3943.111
Date: Thu, 04 May 2023 BIC 4108.712
Time: 00:43:26 HQIC 4010.406
Sample: 0
- 135
Covariance Type: opg
===================================================================================
Ljung-Box (L1) (Q): 0.00, 0.54 Jarque-Bera (JB): 5.65, 149.67
Prob(Q): 0.98, 0.46 Prob(JB): 0.06, 0.00
Heteroskedasticity (H): 2.35, 0.84 Skew: 0.30, -0.58
Prob(H) (two-sided): 0.01, 0.55 Kurtosis: 3.80, 8.03
Results for equation total_due
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
intercept 98.7326 128.968 0.766 0.444 -154.039 351.505
L1.total_due 0.6894 0.556 1.241 0.215 -0.399 1.778
L1.kwh_usage -0.0415 0.053 -0.788 0.431 -0.145 0.062
L2.total_due 0.0015 0.541 0.003 0.998 -1.059 1.062
L2.kwh_usage 0.0169 0.050 0.340 0.734 -0.081 0.114
L3.total_due -0.2510 0.605 -0.415 0.678 -1.436 0.934
L3.kwh_usage 0.0284 0.060 0.470 0.638 -0.090 0.147
L4.total_due 0.2137 0.566 0.378 0.706 -0.895 1.323
L4.kwh_usage 0.0142 0.054 0.264 0.792 -0.091 0.119
L5.total_due -0.0781 0.653 -0.120 0.905 -1.358 1.202
L5.kwh_usage 0.0149 0.063 0.236 0.814 -0.109 0.139
L6.total_due 0.0738 0.456 0.162 0.871 -0.821 0.968
L6.kwh_usage -0.0207 0.047 -0.443 0.658 -0.112 0.071
L7.total_due 0.0086 0.659 0.013 0.990 -1.284 1.301
L7.kwh_usage -0.0069 0.065 -0.107 0.915 -0.134 0.120
L8.total_due -0.3686 0.581 -0.634 0.526 -1.508 0.771
L8.kwh_usage 0.0402 0.055 0.734 0.463 -0.067 0.148
L9.total_due 1.2441 0.542 2.295 0.022 0.182 2.307
L9.kwh_usage -0.1180 0.053 -2.246 0.025 -0.221 -0.015
L10.total_due -0.7691 0.524 -1.469 0.142 -1.795 0.257
L10.kwh_usage 0.0785 0.051 1.546 0.122 -0.021 0.178
L11.total_due -0.4792 0.643 -0.745 0.456 -1.740 0.782
L11.kwh_usage 0.0330 0.065 0.505 0.613 -0.095 0.161
L12.total_due 0.5531 0.710 0.779 0.436 -0.838 1.944
L12.kwh_usage -0.0732 0.071 -1.030 0.303 -0.212 0.066
L13.total_due 0.0447 0.538 0.083 0.934 -1.010 1.100
L13.kwh_usage 0.0303 0.051 0.597 0.551 -0.069 0.130
Results for equation kwh_usage
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
intercept 1314.7998 1459.380 0.901 0.368 -1545.532 4175.131
L1.total_due 3.1167 5.838 0.534 0.593 -8.325 14.559
L1.kwh_usage 0.0080 0.568 0.014 0.989 -1.106 1.122
L2.total_due -3.1502 6.096 -0.517 0.605 -15.098 8.797
L2.kwh_usage 0.4092 0.564 0.726 0.468 -0.696 1.515
L3.total_due -1.2417 6.897 -0.180 0.857 -14.760 12.276
L3.kwh_usage 0.1594 0.681 0.234 0.815 -1.175 1.494
L4.total_due -0.0446 6.340 -0.007 0.994 -12.470 12.381
L4.kwh_usage 0.4208 0.616 0.683 0.495 -0.787 1.629
L5.total_due -0.6764 6.770 -0.100 0.920 -13.946 12.593
L5.kwh_usage 0.1609 0.660 0.244 0.807 -1.132 1.454
L6.total_due 1.6659 5.630 0.296 0.767 -9.368 12.700
L6.kwh_usage -0.3140 0.553 -0.568 0.570 -1.398 0.770
L7.total_due -0.1344 6.859 -0.020 0.984 -13.577 13.308
L7.kwh_usage -0.0447 0.664 -0.067 0.946 -1.347 1.257
L8.total_due -4.1952 7.043 -0.596 0.551 -18.000 9.610
L8.kwh_usage 0.4330 0.635 0.682 0.495 -0.811 1.677
L9.total_due 11.3621 5.896 1.927 0.054 -0.194 22.918
L9.kwh_usage -1.0387 0.568 -1.827 0.068 -2.153 0.075
L10.total_due -8.8169 5.146 -1.713 0.087 -18.903 1.270
L10.kwh_usage 0.8543 0.502 1.702 0.089 -0.130 1.838
L11.total_due -1.3488 6.798 -0.198 0.843 -14.673 11.976
L11.kwh_usage -0.0325 0.690 -0.047 0.962 -1.385 1.320
L12.total_due 5.8202 7.748 0.751 0.453 -9.366 21.007
L12.kwh_usage -0.7153 0.759 -0.942 0.346 -2.203 0.773
L13.total_due -3.6598 5.381 -0.680 0.496 -14.206 6.887
L13.kwh_usage 0.6250 0.505 1.239 0.215 -0.364 1.614
Error covariance matrix
================================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------
sqrt.var.total_due 159.6700 16.314 9.787 0.000 127.695 191.645
sqrt.cov.total_due.kwh_usage 1627.4190 221.763 7.339 0.000 1192.772 2062.066
sqrt.var.kwh_usage 573.6064 55.653 10.307 0.000 464.528 682.685
================================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 8.18e+13. Standard errors may be unstable.
pred = fitted_model.get_prediction(start=len(train_data)-1,end=len(group_df)-1)
predictions = pred.predicted_mean
predictions.columns=['total_due', 'kwh_usage']
predictions.index = test_data.index
plt.figure(figsize=(20, 6))
plt.plot(train_data['kwh_usage'], label = 'Train kwh_usage')
plt.plot(train_data['total_due'], label = 'Train total_due')
plt.plot(test_data['kwh_usage'], label = 'Test kwh_usage')
plt.plot(test_data['total_due'], label = 'Test total_due')
plt.plot(predictions['kwh_usage'], label = 'Predicted kwh_usage')
plt.plot(predictions['total_due'], label = 'Predicted total_due')
plt.legend(loc='upper left')
plt.title('Weekly Electricity Usage Forecasting', fontsize=20)
plt.xlabel('Week', fontsize=12)
plt.ylabel('Average Energy Usage Consumption in Kwh', fontsize=12)
plt.tight_layout()
# Calculating the root mean squared error
rmse_kwh_usage = math.sqrt(mean_squared_error(predictions['kwh_usage'],test_data['kwh_usage']))
print('Mean value of kwh_usage is : {}. Root Mean Squared Error is :{}'.format(mean(test_data['kwh_usage']), rmse_kwh_usage))
rmse_total_due = math.sqrt(mean_squared_error(predictions['total_due'],test_data['total_due']))
print('Mean value of total_due is : {}. Root Mean Squared Error is :{}'.format(mean(test_data['total_due']), rmse_total_due))
Mean value of kwh_usage is : 7270.807960167654. Root Mean Squared Error is :2706.3394255892267 Mean value of total_due is : 551.2324396222277. Root Mean Squared Error is :169.7645963206738
lstm_df = data_df[['kwh_usage', 'week']]
lstm_df = lstm_df.groupby('week').mean()
lstm_df.shape, lstm_df.head()
((157, 1),
kwh_usage
week
26 2550.591241
27 4920.168111
28 4767.283350
29 4825.196879
30 6660.206659)
res = seasonal_decompose(lstm_df['kwh_usage'].dropna(), period=1)
fig = res.plot()
fig.set_size_inches((11, 7))
fig.tight_layout()
plt.show()
train_data = lstm_df.loc['0':'160']
test_data = lstm_df.loc['160':]
lstm_df.head(2), lstm_df.tail(2)
( kwh_usage
week
26 2550.591241
27 4920.168111,
kwh_usage
week
181 5154.602378
182 16454.560694)
scaler = MinMaxScaler()
scaler.fit(train_data)
scaled_train = scaler.transform(train_data)
scaled_test = scaler.transform(test_data)
# define generator - using past 15 values
n_input = 15
n_features = 1
generator = TimeseriesGenerator(scaled_train, scaled_train, length=n_input, batch_size=1)
X,y = generator[0]
print(f'Given Array: \n{X.flatten()}')
print(f'Need to Predict (y): \n {y}')
Given Array: [0.05726747 0.25047208 0.23800654 0.24272855 0.39234697 0.19474895 0.1703378 0.27122726 0.51728633 0.20947108 0.1492983 0.16804037 0.24900017 0.30026072 0.12396093] Need to Predict (y): [[0.23582726]]
# define model
model = Sequential()
model.add(LSTM(100, activation='relu', input_shape=(n_input, n_features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
# fit model
model.fit(generator, epochs=50)
Epoch 1/50 120/120 [==============================] - 4s 13ms/step - loss: 0.0522 Epoch 2/50 120/120 [==============================] - 1s 9ms/step - loss: 0.0449 Epoch 3/50 120/120 [==============================] - 1s 8ms/step - loss: 0.0439 Epoch 4/50 120/120 [==============================] - 1s 8ms/step - loss: 0.0411 Epoch 5/50 120/120 [==============================] - 1s 12ms/step - loss: 0.0425 Epoch 6/50 120/120 [==============================] - 1s 12ms/step - loss: 0.0411 Epoch 7/50 120/120 [==============================] - 1s 7ms/step - loss: 0.0427 Epoch 8/50 120/120 [==============================] - 1s 8ms/step - loss: 0.0415 Epoch 9/50 120/120 [==============================] - 1s 7ms/step - loss: 0.0401 Epoch 10/50 120/120 [==============================] - 1s 8ms/step - loss: 0.0399 Epoch 11/50 120/120 [==============================] - 1s 7ms/step - loss: 0.0402 Epoch 12/50 120/120 [==============================] - 1s 7ms/step - loss: 0.0396 Epoch 13/50 120/120 [==============================] - 1s 7ms/step - loss: 0.0392 Epoch 14/50 120/120 [==============================] - 1s 9ms/step - loss: 0.0364 Epoch 15/50 120/120 [==============================] - 2s 12ms/step - loss: 0.0372 Epoch 16/50 120/120 [==============================] - 1s 9ms/step - loss: 0.0347 Epoch 17/50 120/120 [==============================] - 1s 8ms/step - loss: 0.0346 Epoch 18/50 120/120 [==============================] - 1s 8ms/step - loss: 0.0328 Epoch 19/50 120/120 [==============================] - 1s 8ms/step - loss: 0.0328 Epoch 20/50 120/120 [==============================] - 1s 8ms/step - loss: 0.0313 Epoch 21/50 120/120 [==============================] - 1s 8ms/step - loss: 0.0318 Epoch 22/50 120/120 [==============================] - 1s 8ms/step - loss: 0.0331 Epoch 23/50 120/120 [==============================] - 1s 8ms/step - loss: 0.0290 Epoch 24/50 120/120 [==============================] - 2s 13ms/step - loss: 0.0300 Epoch 25/50 120/120 [==============================] - 2s 18ms/step - loss: 0.0298 Epoch 26/50 120/120 [==============================] - 2s 14ms/step - loss: 0.0298 Epoch 27/50 120/120 [==============================] - 2s 13ms/step - loss: 0.0287 Epoch 28/50 120/120 [==============================] - 2s 13ms/step - loss: 0.0262 Epoch 29/50 120/120 [==============================] - 1s 11ms/step - loss: 0.0286 Epoch 30/50 120/120 [==============================] - 1s 12ms/step - loss: 0.0253 Epoch 31/50 120/120 [==============================] - 2s 17ms/step - loss: 0.0276 Epoch 32/50 120/120 [==============================] - 2s 16ms/step - loss: 0.0263 Epoch 33/50 120/120 [==============================] - 2s 13ms/step - loss: 0.0283 Epoch 34/50 120/120 [==============================] - 2s 20ms/step - loss: 0.0237 Epoch 35/50 120/120 [==============================] - 2s 14ms/step - loss: 0.0271 Epoch 36/50 120/120 [==============================] - 2s 16ms/step - loss: 0.0244 Epoch 37/50 120/120 [==============================] - 2s 16ms/step - loss: 0.0243 Epoch 38/50 120/120 [==============================] - 2s 18ms/step - loss: 0.0235 Epoch 39/50 120/120 [==============================] - 2s 17ms/step - loss: 0.0225 Epoch 40/50 120/120 [==============================] - 2s 15ms/step - loss: 0.0209 Epoch 41/50 120/120 [==============================] - 2s 14ms/step - loss: 0.0197 Epoch 42/50 120/120 [==============================] - 1s 8ms/step - loss: 0.0219 Epoch 43/50 120/120 [==============================] - 1s 8ms/step - loss: 0.0209 Epoch 44/50 120/120 [==============================] - 1s 8ms/step - loss: 0.0205 Epoch 45/50 120/120 [==============================] - 1s 8ms/step - loss: 0.0190 Epoch 46/50 120/120 [==============================] - 1s 11ms/step - loss: 0.0193 Epoch 47/50 120/120 [==============================] - 2s 17ms/step - loss: 0.0221 Epoch 48/50 120/120 [==============================] - 2s 15ms/step - loss: 0.0189 Epoch 49/50 120/120 [==============================] - 2s 17ms/step - loss: 0.0203 Epoch 50/50 120/120 [==============================] - 2s 14ms/step - loss: 0.0193
<keras.callbacks.History at 0x7f04d02aac20>
loss_per_epoch = model.history.history['loss']
plt.plot(range(len(loss_per_epoch)),loss_per_epoch)
plt.title('Plot of LSTM Training Loss vs Epoch')
plt.xlabel('Epoch')
plt.ylabel('Loss')
Text(0, 0.5, 'Loss')
test_preds = []
first_eval_batch = scaled_train[-n_input:]
current_batch = first_eval_batch.reshape((1, n_input, n_features))
for i in range(len(test_data)):
# get pred value for 1st batch
current_pred = model.predict(current_batch)[0]
# add preds into test_predictions[]
test_preds.append(current_pred)
# use pred to update the batch and remove 1st value
current_batch = np.append(current_batch[:,1:,:],[[current_pred]],axis=1)
1/1 [==============================] - 0s 241ms/step 1/1 [==============================] - 0s 25ms/step 1/1 [==============================] - 0s 24ms/step 1/1 [==============================] - 0s 24ms/step 1/1 [==============================] - 0s 24ms/step 1/1 [==============================] - 0s 26ms/step 1/1 [==============================] - 0s 23ms/step 1/1 [==============================] - 0s 24ms/step 1/1 [==============================] - 0s 30ms/step 1/1 [==============================] - 0s 24ms/step 1/1 [==============================] - 0s 24ms/step 1/1 [==============================] - 0s 27ms/step 1/1 [==============================] - 0s 24ms/step 1/1 [==============================] - 0s 24ms/step 1/1 [==============================] - 0s 28ms/step 1/1 [==============================] - 0s 26ms/step 1/1 [==============================] - 0s 26ms/step 1/1 [==============================] - 0s 26ms/step 1/1 [==============================] - 0s 35ms/step 1/1 [==============================] - 0s 44ms/step 1/1 [==============================] - 0s 43ms/step 1/1 [==============================] - 0s 37ms/step 1/1 [==============================] - 0s 33ms/step
true_predictions = scaler.inverse_transform(test_preds)
test_data['predictions'] = true_predictions
plt.figure(figsize=(20, 6))
plt.plot(train_data, label='Train kwh_usage')
plt.plot(test_data['kwh_usage'], label='Test kwh_usage')
plt.plot(test_data['predictions'], label='Predicted kwh_usage')
plt.legend(loc='upper left')
plt.title('Weekly Average Electricity Consumption Forecast')
plt.xlabel('Week')
plt.ylabel('Average Consumption Usage in Kwh')
plt.show()
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse=sqrt(mean_squared_error(test_data['kwh_usage'],test_data['predictions']))
print('Root Mean Squared Error of kwh_usage is :{}'.format(rmse))
Root Mean Squared Error of kwh_usage is :2706.4503345080548
Models proposed:
data_main = pd.read_csv('Electricity_Usage_Data.csv')
data_main[['bill_date']] = data_main[['bill_date']].apply(pd.to_datetime)
data_main.loc[:,'bill_date'] = data_main['bill_date'].apply(lambda x: pd.to_datetime(f'{x.year}-{x.month}-01'))
address_enc = LabelEncoder()
bill_type_enc = LabelEncoder()
data_main['address_enc'] = address_enc.fit_transform(data_main['service_address'])
data_main['bill_type_enc'] = bill_type_enc.fit_transform(data_main['bill_type'])
data_main['year'] = data_main['bill_date'].apply(lambda x:x.year)
data_main["week"] = data_main.apply(lambda row: row["bill_date"].week+52*(row["year"]-2011) ,axis=1)
data_main['month'] = data_main['bill_date'].apply(lambda x:x.month)
## Loading the extracted geolocations from saved pickle file
if os.path.isfile("locations.pkl"):
locations = pickle.load(open("locations.pkl","rb"))
print("Total Geo Location data extracted by Addresses : ",len(locations.keys()))
else:
print("Locations not founds, run Extract Geolocations.ipynb")
Total Geo Location data extracted by Addresses : 3442
data_main["lat"] = data_main["service_address"].apply(lambda x : locations[x][0] \
if x in locations.keys() else locations["Houston"][0])
data_main["lon"] = data_main["service_address"].apply(lambda x : locations[x][1] \
if x in locations.keys() else locations["Houston"][1])
## Remove the rows for which the geo locations are not extraceted, which was assigned to houston city location
data_main = data_main[(data_main["lat"]!='29.7589382')&(data_main["lon"]!='-95.3676974')]
X = data_main[[
'business_area',
'address_enc',
'bill_type_enc',
'year',
'month',
'week',
'kwh_usage',
]]
X.dtypes
business_area int64 address_enc int32 bill_type_enc int32 year int64 month int64 week int64 kwh_usage float64 dtype: object
def calculate_WSS(points, kmax):
sse = []
for k in range(10, kmax+1,10):
kmeans = KMeans(n_clusters = k).fit(points)
centroids = kmeans.cluster_centers_
pred_clusters = kmeans.predict(points)
sse.append(kmeans.inertia_)
print("sse for kmeans with {} clusters is : {}".format(k,kmeans.inertia_))
return sse
sse = calculate_WSS(X.values,100)
sse for kmeans with 10 clusters is : 58267410681664.32 sse for kmeans with 20 clusters is : 12944199129004.955 sse for kmeans with 30 clusters is : 6020451394723.76 sse for kmeans with 40 clusters is : 3440913598796.767 sse for kmeans with 50 clusters is : 2214659637984.169 sse for kmeans with 60 clusters is : 1535920912851.7847 sse for kmeans with 70 clusters is : 1095696134122.1945 sse for kmeans with 80 clusters is : 808177861402.0115 sse for kmeans with 90 clusters is : 638014545319.9745 sse for kmeans with 100 clusters is : 526548423028.7805
plt.plot([i for i in range(10,101,10)],sse)
plt.xlabel("Number of clusters")
plt.ylabel("Sum Squared Error")
plt.show()
## Training the kmeans on best k-value
kmeans = KMeans(n_clusters=40, random_state=0).fit(X)
print("Value Counts of Cluster obtained from kmeans : ",pd.Series(kmeans.labels_).value_counts())
Value Counts of Cluster obtained from kmeans : 27 47879 0 41440 18 6954 32 1815 10 943 38 663 6 340 19 322 29 268 13 184 3 111 34 102 12 75 33 69 30 60 7 38 21 36 16 21 26 21 9 21 17 20 14 19 2 18 25 16 31 13 39 13 23 10 8 10 24 10 35 9 1 8 5 7 11 7 28 5 22 5 20 4 37 4 4 4 15 3 36 1 dtype: int64
data_main["Cluster"] = kmeans.labels_
data_main["lat"].value_counts()
28.4220756 628
29.6432364 315
34.901468 306
33.727672 300
29.681557 281
...
29.72712524161074 1
29.783487 1
29.809849333333332 1
29.840447 1
29.946153 1
Name: lat, Length: 3007, dtype: int64
fig = px.scatter_mapbox(data_main, lat="lat", lon="lon", hover_name="service_address", \
hover_data=["service_address","Cluster","kwh_usage"],color="Cluster", \
color_discrete_sequence=["fuchsia"], zoom=3, height=300, \
center=dict(lat=29.7, lon=-95.3),)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
display.Image("Kmeans2.png")
db = DBSCAN(eps=0.3, min_samples=3)
db.fit(X)
y_pred = db.fit_predict(X)
print("Number of Clusters Obtained from DBSCAN : ",len(set(y_pred)))
Number of Clusters Obtained from DBSCAN : 2
data_main["DBS Cluster"] = y_pred
fig = px.scatter_mapbox(data_main, lat="lat", lon="lon", hover_name="service_address", \
hover_data=["service_address","DBS Cluster","kwh_usage"], \
color="DBS Cluster",color_discrete_sequence=["fuchsia"], \
zoom=3, height=300,center=dict(lat=29.7, lon=-95.3),)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
display.Image("DBSCAN.png")
pca = PCA(n_components = 2)
X_principal = pca.fit_transform(X)
X_principal = pd.DataFrame(X_principal)
X_principal.columns = ['P1', 'P2']
X_principal.shape
(101548, 2)
plt.figure(figsize =(8, 8))
plt.title('Visualising the data')
Dendrogram = shc.dendrogram((shc.linkage(X_principal.iloc[:300], method ='ward')))
## Reducing number of rows as agglomerative clustering is unable to run on large data
num_points = 10000
clustering = AgglomerativeClustering(n_clusters=2).fit(X.iloc[:num_points])
data_aggc = data_main.iloc[:num_points]
data_aggc["Aggc Cluster"] = clustering.labels_
C:\Users\sbelde3\AppData\Local\Temp\ipykernel_11176\2663978122.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
fig = px.scatter_mapbox(data_aggc, lat="lat", lon="lon", hover_name="service_address", \
hover_data=["service_address","Aggc Cluster","kwh_usage"], \
color="Aggc Cluster", color_discrete_sequence=["fuchsia"], \
zoom=3, height=300,center=dict(lat=29.7, lon=-95.3),)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
display.Image("Aggc.png")
The hardest part of the project so far was the initial data cleaning process and making sense of the columns. So far the correlation between the features is still quite low so generating useful insights from these features is proving challenging.
During Summer/Fall months there is quite high usage of energy. As well as there are few business areas where there is more usage of energy. This has been visualized as well.
The results so far are preliminary. We need to iterate further on these results and the observations in order to verify the conclusions befere we can confidently say that our observations are indeed correct.
The data might prove insufficient for few of the ML tasks that we are planning on performing. Therefore finding another data source which can be combined with these datasets is going to prove difficult in case the need arises.
Yes, we believe we are on track with the project. We have a plan to work on the models that have not yet been implemented and we are going to verify the work we have done so far before concluding the work.
The data as it stands right now is good as it has been useful to give the insights we have given so far. Once we are able to figure out the time-series and Clustering analysis part as well we would be able to prove that the data is able to forecast as well and cluster high energy needing areas into groups.
Perfomed time series analysis using three models - ARIMA, VAR, and LSTM.
Out of all the three clustering models, kmeans performed better. It has clearly differentiated between the clusters and was able to give some conclusions out of geo plotted clusters
Since we are a team of 4 graduate students we had to provide 12 (3N) deliverables. They are: